Você está na página 1de 874

Tell us about your PDF experience.

Artificial intelligence (AI) architecture


design
Article • 08/31/2023

Artificial intelligence (AI) is the capability of a computer to imitate intelligent human


behavior. Through AI, machines can analyze images, comprehend speech, interact in
natural ways, and make predictions using data.

AI concepts

Algorithm
An algorithm is a sequence of calculations and rules used to solve a problem or analyze
a set of data. It is like a flow chart, with step-by-step instructions for questions to ask,
but written in math and programming code. An algorithm may describe how to
determine whether a pet is a cat, dog, fish, bird, or lizard. Another far more complicated
algorithm may describe how to identify a written or spoken language, analyze its words,
translate them into a different language, and then check the translation for accuracy.

Machine learning
Machine learning (ML) is an AI technique that uses mathematical algorithms to create
predictive models. An algorithm is used to parse data fields and to "learn" from that
data by using patterns found within it to generate models. Those models are then used
to make informed predictions or decisions about new data.

The predictive models are validated against known data, measured by performance
metrics selected for specific business scenarios, and then adjusted as needed. This
process of learning and validation is called training. Through periodic retraining, ML
models are improved over time.

What are the machine learning products at Microsoft?

Deep learning
Deep learning is a type of ML that can determine for itself whether its predictions are
accurate. It also uses algorithms to analyze data, but it does so on a larger scale than
ML.

Deep learning uses artificial neural networks, which consist of multiple layers of
algorithms. Each layer looks at the incoming data, performs its own specialized analysis,
and produces an output that other layers can understand. This output is then passed to
the next layer, where a different algorithm does its own analysis, and so on.

With many layers in each neural network-and sometimes using multiple neural
networks-a machine can learn through its own data processing. This requires much
more data and much more computing power than ML.

Deep learning versus machine learning

Batch scoring of deep learning models on Azure

Bots
A bot is an automated software program designed to perform a particular task. Think of
it as a robot without a body. Early bots were comparatively simple, handling repetitive
and voluminous tasks with relatively straightforward algorithmic logic. An example
would be web crawlers used by search engines to automatically explore and catalog
web content.

Bots have become much more sophisticated, using AI and other technologies to mimic
human activity and decision-making, often while interacting directly with humans
through text or even speech. Examples include bots that can take a dinner reservation,
chatbots (or conversational AI) that help with customer service interactions, and social
bots that post breaking news or scientific data to social media sites.

Microsoft offers the Azure Bot Service, a managed service purpose-built for enterprise-
grade bot development.

About Azure Bot Service

Ten guidelines for responsible bots

Azure reference architecture: Enterprise-grade conversational bot

Autonomous systems
Autonomous systems are part of an evolving new class that goes beyond basic
automation. Instead of performing a specific task repeatedly with little or no variation
(like bots do), autonomous systems bring intelligence to machines so they can adapt to
changing environments to accomplish a desired goal.

Smart buildings use autonomous systems to automatically control operations like


lighting, ventilation, air conditioning, and security. A more sophisticated example would
be a self-directed robot exploring a collapsed mine shaft to thoroughly map its interior,
determine which portions are structurally sound, analyze the air for breathability, and
detect signs of trapped miners in need of rescue-all without a human monitoring in real
time on the remote end.

Autonomous systems and solutions from Microsoft AI

General info on Microsoft AI


Learn more about Microsoft AI, and keep up-to-date with related news:

Microsoft AI School

Azure AI platform page

Microsoft AI platform page

Microsoft AI Blog

Microsoft AI on GitHub: Samples, reference architectures, and best practices

Azure Architecture Center


High-level architectural types

Prebuilt AI
Prebuilt AI is exactly what it sounds like-off-the-shelf AI models, services, and APIs that
are ready to use. These help you add intelligence to apps, websites, and flows without
having to gather data and then build, train, and publish your own models.

One example of prebuilt AI might be a pretrained model that can be incorporated as is


or used to provide a baseline for further custom training. Another example would be a
cloud-based API service that can be called at will to process natural language in a
desired fashion.

Azure Cognitive Services


Cognitive Services provide developers the opportunity to use prebuilt APIs and
integration toolkits to create applications that can see, hear, speak, understand, and
even begin to reason. The catalog of services within Cognitive Services can be
categorized into five main pillars: Vision, Speech, Language, Web Search, and
Decision/Recommendation.

Azure Cognitive Services documentation

Try Azure Cognitive Services for free

Choosing an Azure Cognitive Services technology

Choosing a natural language processing technology in Azure

Prebuilt AI models in AI Builder

AI Builder is a new capability in Microsoft Power Platform that provides a point-and-click


interface for adding AI to your apps, even if you have no coding or data science skills.
(Some features in AI Builder have not yet released for general availability and remain in
preview status. For more information, refer to the Feature availability by region page.)

You can build and train your own models, but AI Builder also provides select prebuilt AI
models that are ready for use right away. For example, you can add a component in
Microsoft Power Apps based on a prebuilt model that recognizes contact information
from business cards.

Power Apps on Azure


AI Builder documentation

AI model types in AI Builder

Overview of prebuilt AI models in AI Builder

Custom AI
Although prebuilt AI is useful (and increasingly flexible), the best way to get what you
need from AI is probably to build a system yourself. This is obviously a very deep and
complex subject, but let's look at some basic concepts beyond what we've just covered.

Code languages

The core concept of AI is the use of algorithms to analyze data and generate models to
describe (or score) it in ways that are useful. Algorithms are written by developers and
data scientists (and sometimes by other algorithms) using programming code. Two of
the most popular programming languages for AI development are currently Python and
R.

Python is a general-purpose, high-level programming language. It has a simple, easy-


to-learn syntax that emphasizes readability. There is no compiling step. Python has a
large standard library, but it also supports the ability to add modules and packages. This
encourages modularity and lets you expand capabilities when needed. There is a large
and growing ecosystem of AI and ML libraries for Python, including many that are
readily available in Azure.

Python on Azure product home page

Azure for Python developers

Azure Machine Learning SDK for Python

Introduction to machine learning with Python and Azure Notebooks

scikit-learn. An open-source ML library for Python

PyTorch. An open-source Python library with a rich ecosystem that can be used
for deep learning, computer vision, natural language processing, and more

TensorFlow. An open-source symbolic math library also used for ML applications


and neural networks
Tutorial: Apply machine learning models in Azure Functions with Python and
TensorFlow

R is a language and environment for statistical computing and graphics. It can be


used for everything from mapping broad social and marketing trends online to
developing financial and climate models.

Microsoft has fully embraced the R programming language and provides many different
options for R developers to run their code in Azure.

Use R interactively on Azure Machine Learning.

Tutorial: Create a logistic regression model in R with Azure Machine Learning

Training

Training is core to machine learning. It is the iterative process of "teaching" an algorithm


to create models, which are used to analyze data and then make accurate predictions
from it. In practice, this process has three general phases: training, validation, and
testing.

During the training phase, a quality set of known data is tagged so that individual fields
are identifiable. The tagged data is fed to an algorithm configured to make a particular
prediction. When finished, the algorithm outputs a model that describes the patterns it
found as a set of parameters. During validation, fresh data is tagged and used to test
the model. The algorithm is adjusted as needed and possibly put through more training.
Finally, the testing phase uses real-world data without any tags or preselected targets.
Assuming the model's results are accurate, it is considered ready for use and can be
deployed.

Train models with Azure Machine Learning

Hyperparameter tuning

Hyperparameters are data variables that govern the training process itself. They are
configuration variables that control how the algorithm operates. Hyperparameters are
thus typically set before model training begins and are not modified within the training
process in the way that parameters are. Hyperparameter tuning involves running trials
within the training task, assessing how well they are getting the job done, and then
adjusting as needed. This process generates multiple models, each trained using
different families of hyperparameters.

Tune hyperparameters for your model with Azure Machine Learning


Model selection

The process of training and hyperparameter tuning produces numerous candidate


models. These can have many different variances, including the effort needed to prepare
the data, the flexibility of the model, the amount of processing time, and of course the
degree of accuracy of its results. Choosing the best trained model for your needs and
constraints is called model selection, but this is as much about preplanning before
training as it is about choosing the one that works best.

Automated machine learning (AutoML)

Automated machine learning, also known as AutoML, is the process of automating the
time-consuming, iterative tasks of machine learning model development. It can
significantly reduce the time it takes to get production-ready ML models. Automated
ML can assist with model selection, hyperparameter tuning, model training, and other
tasks, without requiring extensive programming or domain knowledge.

What is automated machine learning?

Scoring

Scoring is also called prediction and is the process of generating values based on a
trained machine learning model, given some new input data. The values, or scores, that
are created can represent predictions of future values, but they might also represent a
likely category or outcome. The scoring process can generate many different types of
values:

A list of recommended items and a similarity score

Numeric values, for time series models and regression models

A probability value, indicating the likelihood that a new input belongs to some
existing category

The name of a category or cluster to which a new item is most similar

A predicted class or outcome, for classification models

Batch scoring is when data is collected during some fixed period of time and then
processed in a batch. This might include generating business reports or analyzing
customer loyalty.

Real-time scoring is exactly that-scoring that is ongoing and performed as quickly as


possible. The classic example is credit card fraud detection, but real-time scoring can
also be used in speech recognition, medical diagnoses, market analyses, and many other
applications.

General info on custom AI on Azure

Microsoft AI on GitHub: Samples, reference architectures, and best practices

Custom AI on Azure GitHub repo. A collection of scripts and tutorials to help


developers effectively use Azure for their AI workloads

Azure Machine Learning SDK for Python

Azure Machine Learning service example notebooks (Python). A GitHub repo of


example notebooks demonstrating the Azure Machine Learning Python SDK

Azure Machine Learning SDK for R

Azure AI platform offerings


Following is a breakdown of Azure technologies, platforms, and services you can use to
develop AI solutions for your needs.

Azure Machine Learning


This is an enterprise-grade machine learning service to build and deploy models faster.
Azure Machine Learning offers web interfaces and SDKs so you can quickly train and
deploy your machine learning models and pipelines at scale. Use these capabilities with
open-source Python frameworks, such as PyTorch, TensorFlow, and scikit-learn.

What are the machine learning products at Microsoft?

Azure Machine Learning product home page

Azure Machine Learning documentation overview

What is Azure Machine Learning? General orientation with links to many learning
resources, SDKs, documentation, and more

Machine learning reference architectures for Azure


Batch scoring of Python machine learning models on Azure

Batch scoring of deep learning models on Azure


Machine learning operationalization (MLOps) for Python models using Azure
Machine Learning

Batch scoring of R machine learning models on Azure

Batch scoring of Spark machine learning models on Azure Databricks

Enterprise-grade conversational bot

Build a real-time recommendation API on Azure

Azure automated machine learning


Azure provides extensive support for automated ML. Developers can build models using
a no-code UI or through a code-first notebooks experience.

Azure automated machine learning product home page

Azure automated ML infographic (PDF)

Tutorial: Create a classification model with automated ML in Azure Machine


Learning

Tutorial: Use automated machine learning to predict taxi fares

Configure automated ML experiments in Python

Use the CLI extension for Azure Machine Learning

Automate machine learning activities with the Azure Machine Learning CLI

Azure Cognitive Services


This is a comprehensive family of AI services and cognitive APIs to help you build
intelligent apps. These domain-specific, pretrained AI models can be customized with
your data.

Cognitive Services product home page

Azure Cognitive Services documentation

Azure Cognitive Search


This is an AI-powered cloud search service for mobile and web app development. The
service can search over private heterogenous content, with options for AI enrichment if
your content is unstructured or unsearchable in raw form.

Azure Cognitive Search product home page

Getting started with AI enrichment

Azure Cognitive Search documentation overview

Choosing a natural language processing technology in Azure

Quickstart: Create an Azure Cognitive Search cognitive skill set in the Azure portal

Azure Bot Service


This is a purpose-built bot development environment with out-of-the-box templates to
get started quickly.

Azure Bot Service product home page

Azure Bot Service documentation overview

Azure reference architecture: Enterprise-grade conversational bot

Microsoft Bot Framework

GitHub Bot Builder repo

Apache Spark on Azure


Apache Spark is a parallel processing framework that supports in-memory processing to
boost the performance of big data analytic applications. Spark provides primitives for in-
memory cluster computing. A Spark job can load and cache data into memory and
query it repeatedly, which is much faster than disk-based applications, such as Hadoop.

Apache Spark in Azure HDInsight is the Microsoft implementation of Apache Spark in


the cloud. Spark clusters in HDInsight are compatible with Azure Storage and Azure
Data Lake Storage, so you can use HDInsight Spark clusters to process your data stored
in Azure.

The Microsoft Machine Learning library for Apache Spark is MMLSpark (Microsoft ML
for Apache Spark). It is an open-source library that adds many deep learning and data
science tools, networking capabilities, and production-grade performance to the Spark
ecosystem. Learn more about MMLSpark features and capabilities.
Azure HDInsight overview. Basic information about features, cluster architecture,
and use cases, with pointers to quickstarts and tutorials.

Tutorial: Build an Apache Spark machine learning application in Azure HDInsight

Apache Spark best practices on HDInsight

Configure HDInsight Apache Spark Cluster settings

Machine learning on HDInsight

GitHub repo for MMLSpark: Microsoft Machine Learning library for Apache Spark

Create an Apache Spark machine learning pipeline on HDInsight

Azure Databricks Runtime for Machine Learning


Azure Databricks is an Apache Spark–based analytics platform with one-click setup,
streamlined workflows, and an interactive workspace for collaboration between data
scientists, engineers, and business analysts.

Databricks Runtime for Machine Learning (Databricks Runtime ML) lets you start a
Databricks cluster with all of the libraries required for distributed training. It provides a
ready-to-go environment for machine learning and data science. Plus, it contains
multiple popular libraries, including TensorFlow, PyTorch, Keras, and XGBoost. It also
supports distributed training using Horovod.

Azure Databricks product home page

Azure Databricks documentation

Machine learning capabilities in Azure Databricks

How-to guide: Databricks Runtime for Machine Learning

Batch scoring of Spark machine learning models on Azure Databricks

Deep learning overview for Azure Databricks

Customer stories
Different industries are applying AI in innovative and inspiring ways. Following are a
number of customer case studies and success stories:

ASOS: Online retailer solves challenges with Azure Machine Learning service
KPMG helps financial institutions save millions in compliance costs with Azure
Cognitive Services

Volkswagen: Machine translation speaks Volkswagen – in 40 languages

Buncee: NYC school empowers readers of all ages and abilities with Azure AI

InterSystems: Data platform company boosts healthcare IT by generating critical


information at unprecedented speed

Zencity: Data-driven startup uses funding to help local governments support better
quality of life for residents

Bosch uses IoT innovation to drive traffic safety improvements by helping drivers
avoid serious accidents

Automation Anywhere: Robotic process automation platform developer enriches


its software with Azure Cognitive Services

Wix deploys smart, scalable search across 150 million websites with Azure
Cognitive Search

Asklepios Klinik Altona: Precision surgeries with Microsoft HoloLens 2 and 3D


visualization

AXA Global P&C: Global insurance firm models complex natural disasters with
cloud-based HPC

Browse more AI customer stories

Next steps
To learn about the artificial intelligence development products available from
Microsoft, refer to the Microsoft AI platform page.

For training in how to develop AI solutions, refer to Microsoft AI School .

Microsoft AI on GitHub: Samples, reference architectures, and best practices


organizes the Microsoft open source AI-based repositories, providing tutorials and
learning materials.
Suggest content tags with NLP
using deep learning
Azure Container Registry Azure Cognitive Search Azure Kubernetes Service (AKS) Azure Machine Learning

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This article describes how you can use Microsoft AI to improve website content tagging
accuracy by combining deep learning and natural language processing (NLP) with data
on site-specific search terms.

Architecture

Download a Visio file of this architecture.

Dataflow
1. Data is stored in various formats, depending on its original source. Data can be
stored as files within Azure Data Lake Storage or in tabular form in Azure Synapse
or Azure SQL Database.

2. Azure Machine Learning (ML) can connect and read from such sources, to ingest
the data into the NLP pipeline for pre-processing, model training, and post-
processing.

3. NLP pre-processing includes several steps to consume data, with the purpose of
text generalization. Once the text is broken up into sentences, NLP techniques,
such as lemmatization or stemming, allow the language to be tokenized in a
general form.

4. As NLP models are already available pre-trained, the transfer learning approach
recommends that you download language-specific embeddings and use an
industry standard model, for multi-class text classification, such as variations of
BERT .

5. NLP post-processing recommends storing the model in a model register in Azure


ML, to track model metrics. Furthermore, text can be post-processed with specific
business rules that are deterministically defined, based on the business goals.
Microsoft recommends using ethical AI tools to detect biased language, which
ensures the fair training of a language model.

6. The model can be deployed through Azure Kubernetes Service, while running a
Kubernetes-managed cluster where the containers are deployed from images that
are stored in Azure Container Registry. Endpoints can be made available to a front-
end application. The model can be deployed through Azure Kubernetes Service as
real-time endpoints.

7. Model results can be written to a storage option in file or tabular format, then
properly indexed by Azure Cognitive Search. The model would run as batch
inference and store the results in the respective datastore.

Components
Data Lake Storage for Big Data Analytics
Azure Machine Learning
Azure Cognitive Search
Azure Container Registry
Azure Kubernetes Service (AKS)

Scenario details
Social sites, forums, and other text-heavy Q&A services rely heavily on content tagging,
which enables good indexing and user search. Often, however, content tagging is left to
users' discretion. Because users don't have lists of commonly searched terms or a deep
understanding of the site structure, they frequently mislabel content. Mislabeled content
is difficult or impossible to find when it's needed later.

Potential use cases


By using natural language processing (NLP) with deep learning for content tagging, you
enable a scalable solution to create tags across content. As users search for content by
keywords, this multi-class classification process enriches untagged content with labels
that will allow you to search on substantial portions of text, which improves the
information retrieval processes. New incoming content will be appropriately tagged by
running NLP inference.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Louis Li | Senior Customer Engineer

Next steps
See the product documentation:

Azure Data Lake Storage Gen2 Introduction


Azure Machine Learning
Azure Cognitive Search documentation
Learn more about Azure Container Registry
Azure Kubernetes Service

Try these Microsoft Learn modules:

Introduction to Natural Language Processing with PyTorch


Train and evaluate deep learning models
Implement knowledge mining with Azure Cognitive Search

Related resources
See the following related architectural articles:

Natural language processing technology


Build a delta lake to support ad hoc queries in online leisure and travel booking
Query a data lake or lakehouse by using Azure Synapse serverless
Machine learning operations (MLOps) framework to upscale machine learning
lifecycle with Azure Machine Learning
High-performance computing for manufacturing
Introduction to predictive maintenance in manufacturing
Predictive maintenance solution
Knowledge mining for customer
support and feedback analysis
Azure Cognitive Search Azure AI Language Azure Translator

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This architecture shows how knowledge mining can help customer support teams
quickly find answers to customer questions or assess customer sentiment at scale.

Architecture
There are three steps in knowledge mining: ingest, enrich, and explore.

Download a Visio file of this architecture.

Dataflow
Ingest
The ingest step aggregates content from a range of sources, including structured and
unstructured data. For customer support and feedback analysis, you can ingest different
types of content. This content includes customer support tickets, chat logs, call
transcriptions, customer emails, customer payment history, product reviews, social
media feeds, online comments, feedback forms, and surveys.

Enrich

The enrich step uses AI capabilities to extract information, find patterns, and deepen
understanding. You can enrich content by using key phrase extraction, sentiment
analysis, language translation, bot services, custom models to focus on specific products
or company policies.

Explore

The explore step is explorer data via search, existing business applications, or analytics
solutions. For example, you can compile enriched documents in the knowledge store
and project them into tabular or object stores. The stores can be used to surface trends
in an analytics dashboard identifying frequent issues or popular products. Or, you can
integrate the search index into customer service support applications.

Components
The following key technologies are used to implement tools for technical content review
and research:

Azure Cognitive Search is a cloud search service that supplies infrastructure,


APIs, and tools for searching. You can use Azure Cognitive Search to build search
experiences over private, heterogeneous content in web, mobile, and enterprise
applications.
The web API custom skill interface is used to integrate a custom skill into an Azure
Cognitive Search enrichment pipeline.
Azure Cognitive Service for Language is part of Azure Cognitive Services that
offers many natural language processing services. You can use these services to
understand and analyze text.
Text analytics is a collection of APIs and other features from Azure Cognitive
Service for Language that you can use to extract, classify, and understand text
within documents.
Cognitive Services Translator is part of the Cognitive Services family of REST
APIs. You can use Translator for real-time document and text translation.

Scenario details
For many companies, customer support is costly and doesn't always operate efficiently.
Knowledge mining can help customer support teams quickly find the best answers to
customer questions or assess customer sentiment at scale.

Potential use cases


This solution is optimized for the retail industry.

Azure Cognitive Search is a key part of knowledge mining solutions. Azure Cognitive
Search creates a search index over aggregated and analyzed content.

With queries using the search index, companies can discover trends about what
customers are saying and use that information to improve products and services.

Next steps
To build an initial knowledge mining prototype with Azure Cognitive Search, use
the knowledge mining solution accelerator.

Build an Azure Cognitive Search custom skill.

Explore the learning path Knowledge mining with Azure Cognitive Search.

To learn more about the components in this solution, see these resources:
Azure Cognitive Search documentation
Text analytics REST API reference - Azure Cognitive Services
What is Azure Cognitive Services Translator?

Related resources
Azure Cognitive Search
Text analytics
Large-scale custom natural
language processing
Azure Computer Vision Azure Data Lake Storage Azure Databricks Azure HDInsight

Azure Synapse Analytics

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

Implement a custom natural language processing (NLP) solution in Azure. Use Spark
NLP for tasks like topic and sentiment detection and analysis.

Apache®, Apache Spark , and the flame logo are either registered trademarks or
trademarks of the Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is implied by the use of
these marks.

Architecture

Download a Visio file of this architecture.

Workflow
1. Azure Event Hubs, Azure Data Factory, or both services receive documents or
unstructured text data.
2. Event Hubs and Data Factory store the data in file format in Azure Data Lake
Storage. We recommend that you set up a directory structure that complies with
business requirements.
3. The Azure Computer Vision API uses its optical character recognition (OCR)
capability to consume the data. The API then writes the data to the bronze layer.
This consumption platform uses a lakehouse architecture.
4. In the bronze layer, various Spark NLP features preprocess the text. Examples
include splitting, correcting spelling, cleaning, and understanding grammar. We
recommend running document classification at the bronze layer and then writing
the results to the silver layer.
5. In the silver layer, advanced Spark NLP features perform document analysis tasks
like named entity recognition, summarization, and information retrieval. In some
architectures, the outcome is written to the gold layer.
6. In the gold layer, Spark NLP runs various linguistic visual analyses on the text data.
These analyses provide insight into language dependencies and help with the
visualization of NER labels.
7. Users query the gold layer text data as a data frame and view the results in Power
BI or web apps.

During the processing steps, Azure Databricks, Azure Synapse Analytics, and Azure
HDInsight are used with Spark NLP to provide NLP functionality.

Components
Data Lake Storage is a Hadoop-compatible file system that has an integrated
hierarchical namespace and the massive scale and economy of Azure Blob Storage.
Azure Synapse Analytics is an analytics service for data warehouses and big data
systems.
Azure Databricks is an analytics service for big data that's easy to use, facilitates
collaboration, and is based on Apache Spark. Azure Databricks is designed for data
science and data engineering.
Event Hubs ingests data streams that client applications generate. Event Hubs
stores the streaming data and preserves the sequence of received events.
Consumers can connect to hub endpoints to retrieve messages for processing.
Event Hubs integrates with Data Lake Storage, as this solution shows.
Azure HDInsight is a managed, full-spectrum, open-source analytics service in the
cloud for enterprises. You can use open-source frameworks with Azure HDInsight,
such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm,
and R.
Data Factory automatically moves data between storage accounts of differing
security levels to ensure separation of duties.
Computer Vision uses text recognition APIs to recognize text in images and
extract that information. The Read API uses the latest recognition models, and is
optimized for large, text-heavy documents and noisy images. The OCR API isn't
optimized for large documents but supports more languages than the Read API.
This solution uses OCR to produce data in the hOCR format.

Scenario details
Natural language processing (NLP) has many uses: sentiment analysis, topic detection,
language detection, key phrase extraction, and document categorization.

Apache Spark is a parallel processing framework that supports in-memory processing to


boost the performance of big-data analytic applications like NLP. Azure Synapse
Analytics, Azure HDInsight, and Azure Databricks offer access to Spark and take
advantage of its processing power.

For customized NLP workloads, the open-source library Spark NLP serves as an efficient
framework for processing a large amount of text. This article presents a solution for
large-scale custom NLP in Azure. The solution uses Spark NLP features to process and
analyze text. For more information about Spark NLP, see Spark NLP functionality and
pipelines, later in this article.

Potential use cases


Document classification: Spark NLP offers several options for text classification:
Text preprocessing in Spark NLP and machine learning algorithms that are
based on Spark ML
Text preprocessing and word embedding in Spark NLP and machine learning
algorithms such as GloVe, BERT, and ELMo
Text preprocessing and sentence embedding in spark NLP and machine learning
algorithms and models such as the Universal Sentence Encoder
Text preprocessing and classification in Spark NLP that uses the ClassifierDL
annotator and is based on TensorFlow

Name entity extraction (NER): In Spark NLP, with a few lines of code, you can train
a NER model that uses BERT, and you can achieve state-of-the-art accuracy. NER is
a subtask of information extraction. NER locates named entities in unstructured
text and classifies them into predefined categories such as person names,
organizations, locations, medical codes, time expressions, quantities, monetary
values, and percentages. Spark NLP uses a state-of-the-art NER model with BERT.
The model is inspired by a former NER model, bidirectional LSTM-CNN. That
former model uses a novel neural network architecture that automatically detects
word-level and character-level features. For this purpose, the model uses a hybrid
bidirectional LSTM and CNN architecture, so it eliminates the need for most
feature engineering.

Sentiment and emotion detection: Spark NLP can automatically detect positive,
negative, and neutral aspects of language.

Part of speech (POS): This functionality assigns a grammatical label to each token
in input text.

Sentence detection (SD): SD is based on a general-purpose neural network model


for sentence boundary detection that identifies sentences within text. Many NLP
tasks take a sentence as an input unit. Examples of these tasks include POS
tagging, dependency parsing, named entity recognition, and machine translation.

Spark NLP functionality and pipelines


Spark NLP provides Python, Java, and Scala libraries that offer the full functionality of
traditional NLP libraries such as spaCy, NLTK, Stanford CoreNLP, and Open NLP. Spark
NLP also offers functionality such as spell checking, sentiment analysis, and document
classification. Spark NLP improves on previous efforts by providing state-of-the-art
accuracy, speed, and scalability.

Spark NLP is by far the fastest open-source NLP library. Recent public benchmarks show
Spark NLP as 38 and 80 times faster than spaCy , with comparable accuracy for
training custom models. Spark NLP is the only open-source library that can use a
distributed Spark cluster. Spark NLP is a native extension of Spark ML that operates
directly on data frames. As a result, speedups on a cluster result in another order of
magnitude of performance gain. Because every Spark NLP pipeline is a Spark ML
pipeline, Spark NLP is well-suited for building unified NLP and machine learning
pipelines such as document classification, risk prediction, and recommender pipelines.

Besides excellent performance, Spark NLP also delivers state-of-the-art accuracy for a
growing number of NLP tasks. The Spark NLP team regularly reads the latest relevant
academic papers and produces the most accurate models.

For the execution order of an NLP pipeline, Spark NLP follows the same development
concept as traditional Spark ML machine learning models. But Spark NLP applies NLP
techniques. The following diagram shows the core components of a Spark NLP pipeline.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Moritz Steller | Senior Cloud Solution Architect

Next steps
Spark NLP documentation:
Spark NLP
Spark NLP general documentation
Spark NLP GitHub
Spark NLP demo

Azure components:
Data in Azure Machine Learning
What is Azure HDInsight?
Data Lake Storage
Azure Synapse Analytics
Event Hubs
Azure HDInsight
Data Factory
Computer Vision APIs

Related resources
Natural language processing technology
AI enrichment with image and natural language processing in Azure Cognitive
Search
Analyze news feeds with near real-time analytics using image and natural language
processing
Suggest content tags with NLP using deep learning
Image classification with
convolutional neural networks
(CNNs)
Azure Blob Storage Azure Container Registry Azure Data Science Virtual Machines

Azure Kubernetes Service (AKS) Azure Machine Learning

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

Use convolutional neural networks (CNNs) to classify large volumes of images efficiently
to identify elements in images.

Architecture

Download a Visio file of this architecture.

Dataflow
1. Image uploads to Azure Blob Storage are ingested by Azure Machine Learning.
2. Because the solution follows a supervised learning approach and needs data
labeling to train the model, the ingested images are labeled in Machine Learning.
3. The CNN model is trained and validated in the Machine Learning notebook.
Several pre-trained image classification models are available. You can use them by
using a transfer learning approach. For information about some variants of pre-
trained CNNs, see Advancements in image classification using convolutional neural
networks . You can download these image classification models and customize
them with your labeled data.
4. After training, the model is stored in a model registry in Machine Learning.
5. The model is deployed through batch managed endpoints.
6. The model results are written to Azure Cosmos DB and consumed through the
front-end application.

Components
Blob Storage is a service that's part of Azure Storage . Blob Storage offers
optimized cloud object storage for large amounts of unstructured data.
Machine Learning is a cloud-based environment that you can use to train,
deploy, automate, manage, and track machine learning models. You can use the
models to forecast future behavior, outcomes, and trends.
Azure Cosmos DB is a globally distributed, multi-model database. With Azure
Cosmos DB, your solutions can elastically scale throughput and storage across any
number of geographic regions.
Azure Container Registry builds, stores, and manages container images and can
store containerized machine learning models.

Scenario details
With the rise of technologies such as the Internet of Things (IoT) and AI, the world is
generating large amounts of data. Extracting relevant information from the data has
become a major challenge. Image classification is a relevant solution to identifying what
an image represents. Image classification can help you categorize high volumes of
images. Convolutional neural networks (CNNs) render good performance on image
datasets. CNNs have played a major role in the development of state-of-the-art image
classification solutions.

There are three main types of layers in CNNs:

Convolutional layers
Pooling layers
Fully connected layers

The convolutional layer is the first layer of a convolutional network. This layer can follow
another convolutional layer or pooling layers. In general, the fully connected layer is the
final layer in the network.

As the number of layers increases, the complexity of the model increases, and the model
can identify greater portions of the image. The beginning layers focus on simple
features, such as edges. As the image data advances through the layers of the CNN, the
network starts recognizing more sophisticated elements or shapes in the object. Finally,
it identifies the expected object.

Potential use cases


This solution can help automate failure detection, which is preferable to relying
solely on human operators. For instance, this solution can boost productivity by
identifying faulty electronic components. This capability is important for lean
manufacturing, cost control, and waste reduction in manufacturing. In circuit-
board manufacturing, faulty boards can cost manufacturers money and
productivity. Assembly lines rely on human operators to quickly review and
validate boards that are flagged as potentially faulty by assembly-line test
machines.
Image classification is ideal for the healthcare industry. Image classification helps
detect bone cracks, various types of cancer, and anomalies in tissues. You can also
use image classification to flag irregularities that can indicate the presence of
disease. An image classification model can improve the accuracy of MRIs.
In the agriculture domain, image classification solutions help identify plant
diseases and plants that require water. As a result, image classification helps to
reduce the need for human intervention.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributor.

Principal author:

Ashish Chauhan | Senior Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
To learn more about Blob Storage, see Introduction to Azure Blob Storage.
To learn more about Container Registry, see Introduction to Container registries in
Azure.
To learn more about model management (MLOps), see MLOps: Model
management, deployment, lineage, and monitoring with Azure Machine Learning.
To browse an implementation of this solution idea on GitHub, see Synapse
Machine Learning .
To explore a Microsoft Learn module that includes a section on CNNs, see Train
and evaluate deep learning models.

Related resources
Visual search in retail with Azure Cosmos DB
Retail assistant with visual
capabilities
Azure App Service Bing Custom Search Bing Visual Search Azure AI Bot Service Azure AI services

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This solution idea uses Azure services with a bot assistant to improve interactions with
customers and provide suggestions based on visual information.

Architecture

Bing Custom Search

Application 8 7

3
Mobile 1 2
4
Language
Azure App Service Azure Bot Service
Understanding

Web browser
5 6

User Input

Bing Visual Search

Microsoft

Azure

Download a Visio file of this architecture.

Dataflow
1. The user uses an application, which is hosted on Azure App Service, either via a
web browser or a mobile device.
2. App Service communicates with Azure Bot Service to facilitate the interaction
between the user and the application.
3. Bot Service uses Azure Cognitive Services Language Understanding to identify user
intents and meaning.
4. Language Understanding (LUIS) returns the identified user intent to the Azure bot.
5. The bot passes a visual context input, such as an image, to the Bing Visual Search
API.
6. The API returns output to Bot Service.
7. Optionally, the bot retrieves more information for user queries within the user's
domain by using the Bing Custom Search API.
8. The Custom Search API returns output to Bot Service.

Components
App Service provides a framework for building, deploying, and scaling web apps.
Bot Service provides an integrated development environment for bot building.
Cognitive Services consists of cloud-based services that provide AI functionality.
Azure Cognitive Service for Language is part of Cognitive Services that offers
many natural language processing services.
Conversational language understanding is a feature of Cognitive Service for
Language. This cloud-based API service offers machine-learning intelligence
capabilities for building conversational apps. You can use language understanding
(LUIS) to predict the meaning of a conversation and pull out relevant, detailed
information.
The Bing Visual Search API returns data that's related to a given image, such as
similar images, shopping sources for purchasing the item in the image, and
webpages that include the image.
The Bing Custom Search API provides a way to create tailored ad-free search
experiences for topics.

Scenario details
This solution features a bot assistant with search integration. The bot can help
customers interact with a business application. It can also provide suggestions based on
visual information.

Potential use cases


This solution can be used broadly, but is ideal for the retail industry and the travel and
hospitality industries.

Next steps
What is Azure Cognitive Services?
What is Language Understanding (LUIS)?
Bing Search API documentation
What is the Bing Visual Search API?
What is the Bing Custom Search API?
App Service overview
Azure Bot Service documentation
Introduction to Bot Framework Composer

Related resources
Visual assistant
Artificial intelligence (AI) - Architectural overview
Choose a Microsoft Azure Cognitive Services technology
Visual assistant
Azure App Service Azure AI Bot Service Azure AI services

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This solution presents a visual assistant that provides rich information that's based on
the content of an image.

Architecture

Customer mobile / Azure App Service


browser

Language
Azure Bot Service
Understanding

4 5 6
Bing Visual Search

Bing Entity Search Bing Web Search Bing Custom Search

Microsoft 
Azure

Download a Visio file of this architecture.


Dataflow
1. Users interact with a bot through a mobile app or a web app.
2. The bot uses Language Understanding Intelligence Service (LUIS), which is built
into the application, to identify the user intent and conversational context.
3. The bot passes visual context, such as an image, to the Bing Visual Search API.
4. The bot retrieves information from the Bing Entity Search API about people, places,
artwork, monuments, and objects that are related to the image.
5. The bot retrieves information from barcodes.
6. Optionally, the bot gets more information about barcodes or queries that's limited
to the user's domain by using the Bing Custom Search API.
7. The visual assistant presents the user with the information about related products,
destinations, celebrities, places, monuments, and artwork.

Components
Azure App Service is a fully managed HTTP-based service for hosting web apps,
REST APIs, and mobile backends.
Azure Bot Service offers an environment for developing intelligent, enterprise-
grade bots that enrich customer experiences. The integrated environment also
provides a way to maintain control of your data.
The Bing Custom Search API provides a way to create customized search
experiences with Bing's powerful ranking and global-scale search index.
The Bing Entity Search API offers search capabilities that identify relevant
entities, such as well-known people, places, movies, TV shows, video games, books,
and businesses.
The Bing Visual Search API returns data that's related to a given image, such as
similar images, shopping sources for purchasing the item in the image, and
webpages that include the image.
The Bing Web Search API provides search results after you issue a single API call.
The results compile relevant information from billions of webpages, images,
videos, and news.
Azure Cognitive Service for Language is part of Azure Cognitive Services that
offers many natural language processing services.
Conversational language understanding is a feature of Cognitive Service for
Language. This cloud-based API service offers machine-learning intelligence
capabilities for building conversational apps. You can use LUIS to predict the
meaning of a conversation and pull out relevant, detailed information.

Scenario details
This solution presents a visual assistant that provides rich information that's based on
the content of an image. The assistant's capabilities include reading business cards,
deciphering barcodes, and recognizing well-known people, places, objects, artwork, and
monuments.

Potential use cases


Organizations can use this solution to provide:

Appointment scheduling.
Order and delivery tracking in manufacturing, automotive, and transportation
applications.
Barcode purchases in retail.
Payment processing in finance and retail.
Subscription renewals in retail.
The identification of well-known people, places, objects, art, and monuments, in
the education, media, and entertainment industries.

Next steps
To design an app that detects context that matters to you, see Quickstart: Create
an object detection project with the Custom Vision client library.

To explore the search capabilities that Bing provides, see Bing family of search
APIs.

To build LUIS into your bot, see Add natural language understanding to your bot.

To explore a Learn module about how LUIS works, see Create a language model
with Conversational Language Understanding.

To learn how to build with Bot Service, see Build a bot with the Language Service
and Azure Bot Service.

To create a bot that incorporates QnA Maker and Bot Service, see Create
conversational AI solutions.

To solidify your understanding of LUIS, Bot Service, and the Bing Visual Search API,
see Exam AI-900: Microsoft Azure AI Fundamentals.

To certify your knowledge about Cognitive Services, see Microsoft Certified: Azure
AI Engineer Associate.
To learn more about the components in this solution, see these resources:
App Service overview
Azure Bot Service documentation
What is Bing Custom Search?
What is Bing Entity Search API?
What is the Bing Visual Search API?
What is the Bing Web Search API?
What is Language Understanding (LUIS)?

Related resources
Artificial intelligence (AI) - Architectural overview
Image classification on Azure
Retail assistant with visual capabilities
Vision classifier model with Azure
Custom Vision Cognitive Service
Azure GitHub

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This architecture uses Custom Vision to classify images taken by a simulated drone. It
provides a way to combine AI and the Internet of Things (IoT). Azure Custom Vision can
also be used for object detection purpose.

Architecture

Download a Visio file of this architecture.

Workflow
1. Use AirSim's 3D-rendered environment to take images taken with the drone. Use
the images as the training dataset.
2. Import and tag the dataset in a Custom Vision project. The cognitive service trains
and tests the model.
3. Export the model into TensorFlow format so you can use it locally.
4. The model can also be deployed to a container or to mobile devices.
Components

Microsoft AirSim Drone simulator


Microsoft AirSim Drone simulator is built on the Unreal Engine . The simulator is
open-source, cross-platform, and developed to help AI research. In this architecture, it
creates the dataset of images used to train the model.

Azure Custom Vision

Azure Custom Vision is part of Azure Cognitive Services . In this architecture, it


creates an image classifier model.

TensorFlow
TensorFlow is an open-source platform for machine learning (ML). It's a tool that helps
you develop and train ML models. When you export your model to TensorFlow format,
you'll have a protocol buffer file with the Custom Vision model that you can use locally
in your script.

Scenario details
Azure Cognitive Services offers many possibilities for Artificial Intelligence (AI) solutions.
One of them is Azure Custom Vision, which allows you to build, deploy, and improve
your image classifiers. This architecture uses Custom Vision to classify images taken by a
simulated drone. It provides a way to combine AI and the Internet of Things (IoT). Azure
Custom Vision can also be used for object detection purpose.

Potential use case


This solution is ideal for the rescue, simulation, robotics, aircraft, aerospace, and aviation
industries.

Microsoft Search and Rescue Lab suggests a hypothetical use case for Custom Vision.
In the lab, you fly a Microsoft AirSim simulated drone around in a 3D-rendered
environment. You use the simulated drone to capture synthetic images of the animals in
that environment. After creating a dataset of images, you use the dataset to train a
Custom Vision classifier model. To train the model, you tag the images with the names
of the animals. When you fly the drone again, take new images of the animals. This
solution identifies the name of the animal in each new image.
In a practical application of the lab, an actual drone replaces the Microsoft AirSim
simulated drone. If a pet is lost, the owner provides images of the pet to the Custom
Vision model trainer. Just like in the simulation, the images are used to train the model
to recognize the pet. Then, the drone pilot searches an area where the lost pet might be.
As it finds animals along the way, the drone's camera can capture images and determine
if the animal is the lost pet.

Deploy this scenario


To deploy this reference architecture, follow the steps described in the GitHub repo of
the Search and Rescue Lab .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Jose Contreras | Principal Software Engineer

Next steps
Learn more about Microsoft AirSim
Learn more about Azure Custom Vision Cognitive Service
Learn more about Azure Cognitive Services

Related resources
Read other Azure Architecture Center articles:

Image classification on Azure


Geospatial analysis with Azure Synapse Analytics
AI enrichment with image and natural language processing in Azure Cognitive
Search
Keyword search and speech-to-
text
Azure Content Delivery Network Azure Cognitive Search Azure Media Player Azure Video Indexer

Azure App Service

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This solution idea identifies speech in static video files to manage speech as standard
content.

Architecture

Azure Encoder
(Standard or
Premium)

Source Azure Blob Streaming Multi-Protocol Azure CDN Azure Media player
Audio/Video files Storage endpoint Dynamic Packaging/
Multi-DRM

TTML, WebVTT
Keywords

Azure Media
Indexer/OCR Media Azure Search
processor
Web Apps

Microsoft

Azure

Download a Visio file of this architecture.


Dataflow
Azure Blob Storage stores large amounts of unstructured data that can be
accessed from anywhere in the world via HTTP or HTTPS. You can use Blob Storage
to expose data publicly to the world, or to store application data privately.
Azure Encoding converts media files from one encoding to another.
Azure streaming endpoint represents a streaming service that can deliver content
directly to a client player application, or to a content delivery network (CDN) for
further distribution.
Content Delivery Network provides secure, reliable content delivery with broad
global reach and a rich feature set.
Azure Media Player uses industry standards, such as HTML5 (MSE/EME) to provide
an enriched adaptive streaming experience. Regardless of the playback technology
used, you have a unified JavaScript interface to access APIs.
Azure Cognitive Search provides a ready-to-use service that gets populated with
data and then used to add search functionality to a web or mobile application.
Web Apps hosts the website or web application.
Azure Media Indexer makes the content of your media files searchable and
generates a full-text transcript for closed-captioning and keywords. Media files are
processed individually or in batches.

Components
Blob Storage is a service that's part of Azure Storage . Blob Storage offers
optimized cloud object storage for large amounts of unstructured data.
Azure Media Services is a cloud-based platform that you can use to stream
video, enhance accessibility and distribution, and analyze video content.
Live and on-demand streaming is a feature of Azure Media Services that delivers
content to various devices at scale.
Azure Encoding provides a way to convert files that contain digital video or
audio from one standard format to another.
Azure Media Player plays videos that are in various formats.
Azure Content Delivery Network offers a global solution for rapidly delivering
content. This service provides your users with fast, reliable, and secure access to
your apps' static and dynamic web content.
Azure Cognitive Search is a cloud search service that supplies infrastructure,
APIs, and tools for searching. You can use Azure Cognitive Search to build search
experiences over private, heterogeneous content in web, mobile, and enterprise
applications.
App Service provides a framework for building, deploying, and scaling web apps.
The Web Apps feature is a service for hosting web applications, REST APIs, and
mobile back ends.
Azure Media Indexer provides a way to make content of your media files
searchable. It can also generate a full-text transcript for closed captioning and
keywords.

Scenario details
A speech-to-text solution provides a way to identify speech in static video files so you
can manage it as standard content. For instance, employees can use this technology to
search within training videos for spoken words or phrases. Then they can navigate to the
specific moment in the video that contains the word or phrase.

When you use this solution, you can upload static videos to an Azure website. The Azure
Media Indexer uses the Speech API to index the speech within the videos and stores it in
an Azure database. You can search for words or phrases by using the Web Apps feature
of Azure App Service. Then you can retrieve a list of results. When you select a result,
you can see the place in the video that mentions the word or phrase.

This solution is built on the Azure managed services Content Delivery Network and
Azure Cognitive Search .

Potential use cases


This solution applies to scenarios that can benefit from the ability to search recorded
speech. Examples include:

Training and educational videos.


Crime investigations.
Customer service analysis.

Next steps
How to use Azure Blob Storage
How to encode an asset using Media Encoder
How to manage streaming endpoints
Using Azure Content Delivery Network
Develop video player applications
Create an Azure Cognitive Search service
Run Web Apps in the cloud
Indexing media files

Related resources
Gridwich cloud media system
Live stream digital media
Video-on-demand digital media
Customer churn prediction using
real-time analytics
Azure Machine Learning

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

Customer Churn Prediction uses Azure AI platform to predict churn probability, and it
helps find patterns in existing data that are associated with the predicted churn rate.

Architecture

Download a Visio file of this architecture.

Dataflow
1. Use Azure Event Hubs to stream all live data into Azure.
2. Process real-time data using Azure Stream Analytics . Stream Analytics can
output processed data into Azure Synapse . This allows customers to combine
existing and historical data to create dashboards and reports in Power BI.

3. Ingest historical data at scale into Azure Blob Storage using Azure Synapse or
another ETL tool.

4. Use Azure Synapse to combine streaming data with historical data for reporting
or experimentation in Azure Machine Learning .

5. Use Azure Machine Learning to build models for predicting churn probability
and identify data patterns to deliver intelligent insights.

6. Use Power BI to build operational reports and dashboards on top of Azure


Synapse. Azure Machine Learning models can be used to further enhance
reporting and to assist businesses in decision making processes.

Components
Azure Event Hubs is an event ingestion service that can process millions of
events per second. Data sent to event hub can be transformed and stored using
any real-time analytics provider.
Azure Stream Analytics is a real-time analytics engine designed to analyze and
process high volume of fast streaming data. Relationships and patterns identified
in the data can be used to trigger actions and initiate workflows such as creating
alerts, feeding information to a reporting tool, or storing transformed data for later
use.
Azure Blob Storage is a cloud service for storing large amounts of unstructured
data such as text, binary data, audio, and documents more-easily and cost-
effectively. Azure Blob Storage allows data scientists quick access to data for
experimentation and AI model building.
Azure Synapse Analytics is a fast and reliable data warehouse with limitless
analytics that brings together data integration, enterprise data warehousing, and
big data analytics. It gives you the freedom to query data on your terms, using
either serverless or dedicated resources and serve data for immediate BI and
machine learning needs.
Azure Machine Learning can be used for any supervised and unsupervised
machine learning, whether you prefer to write Python of R code. You can build,
train, and track machine learning models in an Azure Machine Leaning workspace.
Power BI is a suite of tools that delivers powerful insights to organizations.
Power BI connects to various data sources, simplify data prep and model creation
from disparate sources. Enhance team collaboration across the organization to
produce analytical reports and dashboard to support the business decisions and
publish them to the web and mobile devices for users to consume.

Scenario details
Keeping existing customers is five times cheaper than the cost of getting new
customers. For this reason, marketing executives often find themselves trying to
estimate the likelihood of customer churn and finding the necessary actions to minimize
the churn rate.

Potential use cases


This solution uses Azure Machine Learning to predict churn probability and helps find
patterns in existing data associated with the predicted churn rate. By using both
historical and near real-time data, users are able to create predictive models to analyze
characteristics and identify predictors of the existing audience. This information provides
businesses with actionable intelligence to improve customer retention and profit
margins.

This solution is optimized for the retail industry.

Deploy this scenario


For more details on how to build and deploy this solution, visit the solution guide in
GitHub .

The objective of this guide is to demonstrate predictive data pipelines for retailers to
predict customer churn. Retailers can use these predictions to prevent customer churn
by using their domain knowledge and proper marketing strategies to address at-risk
customers. The guide also shows how customer churn models can be retrained to use
more data as it becomes available.

What's under the hood


The end-to-end solution is implemented in the cloud, using Microsoft Azure. The
solution is composed of several Azure components, including data ingest, data storage,
data movement, advanced analytics, and visualization. The advanced analytics are
implemented in Azure Machine Learning, where you can use Python or R language to
build data science models. Or you can reuse existing in-house or third-party libraries.
With data ingest, the solution can make predictions based on data transferred to Azure
from an on-premises environment.

Solution dashboard
The snapshot below shows an example Power BI dashboard that gives insights into the
predicted churn rates across a customer base.

Next steps
About Azure Event Hubs
Welcome to Azure Stream Analytics
What is Azure Synapse Analytics?
Introduction to Azure Blob Storage
What is Azure Machine Learning?
What is Power BI?

Related resources
Architecture guides:

Artificial intelligence (AI)


Compare the machine learning products and technologies from Microsoft
Machine learning operations (MLOps) framework

Reference architectures:
Batch scoring for deep learning models
Batch scoring of Python models on Azure
Build a speech-to-text transcription pipeline
Personalized offers
Azure Event Hubs Azure Functions Azure Machine Learning Azure Storage Azure Stream Analytics

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This solution builds intelligent marketing systems that provide customer-tailored


content by using machine learning models that analyze data from multiple sources. Key
technologies used include Intelligent Recommendations and Azure Personalizer.

Architecture
User activity Ingest Storage Analyze Visualize

1 3 4

Aggregated
Data 6
Azure Cosmos DB Intelligent Power BI
Azure Event Hubs
(SQL API) Recommendations

11
Products Function App 2

Offers 7
Raw Stream Data
Azure Stream Azure Data Lake
Analytics Storage

References

8 9
Product Views

Azure Cache for Azure 10


Web App User
Offer Views Redis Personalizer


Microsoft
Azure

Download a Visio file of this architecture.

Dataflow
1. An Azure Function app captures the raw user activity (such as product and offer
clicks) and offers that are made to users on the website. The activity is sent to
Azure Event Hubs. In areas where user activity is not available, the simulated user
activity is stored in Azure Cache for Redis.
2. Azure Stream Analytics analyzes the data to provide near real-time analytics on the
input stream from the Azure Event Hubs instance.
3. The aggregated data is sent to Azure Cosmos DB for NoSQL.
4. Power BI is used to look for insights on the aggregated data.
5. The raw data is sent to Azure Data Lake Storage.
6. Intelligent Recommendations uses the raw data from Azure Data Lake Storage and
provides recommendations to Azure Personalizer.
7. The Personalizer service serves the top contextual and personalized products and
offers.
8. Simulated user activity data is provided to the Personalizer service to provide
personalized products and offers.
9. The results are provided on the web app that the user accesses.
10. User feedback is captured based on the reaction of the user to the displayed offers
and products. The reward score is provided to the Personalizer service to make it
perform better over time
11. Retraining for Intelligent Recommendations can result in better recommendations.
This process can also be done by using refreshed data from Azure Data Lake
Storage.

Components
Event Hubs is a fully managed streaming platform. In this solution, Event Hubs
collects real-time consumption data.
Stream Analytics offers real-time serverless stream processing. This service
provides a way to run queries in the cloud and on edge devices. In this solution,
Stream Analytics aggregates the streaming data and makes it available for
visualization and updates.
Azure Cosmos DB is a globally distributed, multi-model database. With Azure
Cosmos DB, your solutions can elastically scale throughput and storage across any
number of geographic regions. The Azure Cosmos DB for NoSQL stores data in
document format and is one of several database APIs that Azure Cosmos DB
offers. In the GitHub implementation of this solution, DocumentDB was used to
store the customer, product, and offer information, but you can also use Azure
Cosmos DB for NoSQL. For more information, see Dear DocumentDB customers,
welcome to Azure Cosmos DB! .
Storage is a cloud storage solution that includes object, file, disk, queue, and
table storage. Services include hybrid storage solutions and tools for transferring,
sharing, and backing up data. This solution uses Storage to manage the queues
that simulate user interaction.
Functions is a serverless compute platform that you can use to build
applications. With Functions, you can use triggers and bindings to integrate
services. This solution uses Functions to coordinate the user simulation. Functions
is also the core component that generates personalized offers.
Machine Learning is a cloud-based environment that you can use to train,
deploy, automate, manage, and track machine learning models. Here, Machine
Learning uses each user's preferences and product history to provide the user-to-
product affinity scoring.
Azure Cache for Redis provides an in-memory data store that's based on Redis
software. Azure Cache for Redis provides open-source Redis capabilities as a fully
managed offering. In this solution, Azure Cache for Redis provides pre-computed
product affinities for customers with no available user history.
Power BI is a business analytics service that provides interactive visualizations
and business intelligence capabilities. Its easy-to-use interface makes it possible
for you to create your own reports and dashboards. This solution uses Power BI to
display real-time activity in the system. For instance, Power BI uses the data from
Azure Cosmos DB for NoSQL to display the customer response to various offers.
Data Lake Storage is a scalable storage repository that holds a large amount of
data in the data's native, raw format.

Solution details
In today's highly competitive and connected environment, modern businesses can no
longer survive on generic, static online content. Furthermore, marketing strategies that
use traditional tools can be expensive and hard to implement. As a result, they don't
produce the desired return on investment. These systems often fail to take full
advantage of collected data when they create a more personalized experience for users.

Presenting offers that are customized for each user has become essential to building
customer loyalty and remaining profitable. On a retail website, customers desire
intelligent systems that provide offers and content based on their unique interests and
preferences. Today's digital marketing teams can build this intelligence by using the data
that's generated from all types of user interactions.

Marketers now have the opportunity to deliver highly relevant and personalized offers
to each user by analyzing massive amounts of data. But building a reliable and scalable
big data infrastructure isn't trivial. And developing sophisticated machine learning
models that are personalized for each user is also a complex undertaking.

Intelligent Recommendations offers capabilities to drive desired outcomes, such as item


recommendations that are based on user interactions and metadata. It can be used to
promote and personalize any content type, such as sellable products, media,
documents, offers, and more.
Azure Personalizer is a service that's part of Azure Cognitive Services. It can be used to
determine what product to suggest to shoppers or to figure out the optimal position for
an advertisement. Personalizer acts as the additional last-step ranker. After the
recommendations are shown to the user, the user's reaction is monitored and reported
as a reward score back to the Personalizer service. This process ensures that the service
is learning continuously, and it enhances Personalizer's ability to select the best items
based on the contextual information received.

Microsoft Azure provides advanced analytics tools in the areas of data ingestion, data
storage, data processing, and advanced analytics components—all the essential
elements for building a personalized offer solution.

System integrator
You can save time when you implement this solution by hiring a trained system
integrator (SI). The SI can help you develop a proof of concept and can help deploy and
integrate the solution.

Potential use cases


This solution applies to the marketing of goods and services based on customer data
(products viewed and / or purchased). This could be applicable in the following areas:

E-commerce - This is an area where personalization is widely used with customer


behavior and product recommendations.

Retail - Based on prior purchase data, recommendations and offers can be


provided on products.

Telecom - Based on user interaction in this area, recommendations can be


provided. Compared to other industries, the product and offer ranges might be
limited.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mahi Sundararajan | Senior Customer Engineer

To see non-public LinkedIn profiles, sign in to LinkedIn.


Next steps
Detailed information about the classifiers that this model uses
MLOps: Model management, deployment, lineage, and monitoring with Azure
Machine Learning
Build a real-time recommendation API on Azure
Microsoft Certified: Data Scientist Associate certification
Create a classification model with Azure Machine Learning designer, with no
coding required
Use automated machine learning in Azure Machine Learning and learn how to
create a drag-and-drop machine learning model
Azure Event Hubs—A big data streaming platform and event ingestion service
Welcome to Azure Stream Analytics
Welcome to Azure Cosmos DB
Introduction to Azure Storage
Introduction to Azure Functions
What is Azure Machine Learning?
About Azure Cache for Redis
Create reports and dashboards in Power BI - documentation
Introduction to Azure Data Lake Storage Gen2

Related resources
Artificial intelligence (AI) - Architectural overview
Azure Machine Learning documentation
Optimize marketing with machine
learning
Azure AI services Azure Synapse Analytics Azure Machine Learning Azure Data Lake Power BI

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

Azure services can extract insights from social media for you to use in big data
marketing campaigns.

Architecture

Ingestion Enrichment Storage Visualization

Cognitive Services Azure Web


App Services

1 2 4 5
External data
(Text, posts) Azure Synapse Azure Data Lake
Analytics

3
Azure Machine Microsoft
Learning Power BI

Microsoft

Azure

Download a Visio file of this architecture.

Dataflow
1. Azure Synapse Analytics enriches data in dedicated SQL pools with the model
that's registered in Azure Machine Learning via a stored procedure.
2. Azure Cognitive Services enriches the data by running sentiment analysis,
predicting overall meaning, extracting relevant information, and applying other AI
features. Machine Learning is used to develop a machine learning model and
register the model in the Machine Learning registry.
3. Azure Data Lake Storage provides storage for the machine learning data and a
cache for training the machine learning model.
4. The Web Apps feature of Azure App Service is used to create and deploy scalable
business-critical web applications. Power BI provides an interactive dashboard with
visualizations that use data that's stored in Azure Synapse Analytics to drive
decisions on the predictions.

Components
Azure Synapse Analytics is an integrated analytics service that accelerates time
to insight across data warehouses and big data systems.

Cognitive Services consists of cloud-based services that provide AI functionality.


The REST APIs and client library SDKs help you build cognitive intelligence into
apps even if you don't have AI or data science skills.

Machine Learning is a cloud-based environment that you can use to train,


deploy, automate, manage, and track machine learning models.

Data Lake Storage is a massively scalable and secure data lake for high-
performance analytics workloads.

App Service provides a framework for building, deploying, and scaling web apps.
The Web Apps feature is a service for hosting web applications, REST APIs, and
mobile back ends.

Power BI is a collection of analytics services and apps. You can use Power BI to
connect and display unrelated sources of data.

Scenario details
Marketing campaigns are about more than the message that you deliver. When and
how you deliver that message is just as important. Without a data-driven, analytical
approach, campaigns can easily miss opportunities or struggle to gain traction.

These days, marketing campaigns are often based on social media analysis, which has
become increasingly important for companies and organizations around the world.
Social media analysis is a powerful tool that you can use to receive instant feedback on
products and services, improve interactions with customers to increase customer
satisfaction, keep up with the competition, and more. Companies often lack efficient,
viable ways to monitor social media conversations. As a result, they miss countless
opportunities to use these insights to inform their strategies and plans.

Potential use cases


If you can extract information about your customers from social media, you can enhance
customer experiences, increase customer satisfaction, gain new leads, and prevent
customer churn. These applications of social media analytics fall into three main areas:

Measuring brand health:


Capturing customer reactions and feedback for new products on social media.
Analyzing sentiment on social media interactions for a newly introduced
product.

Building and maintaining customer relationships:


Quickly identifying customer concerns.
Listening to untagged brand mentions.

Optimizing marketing investments:


Extracting insights from social media for campaign analysis.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Christina Skarpathiotaki | AI Cloud Solution Architect

Other contributors:

Nicholas Moore | Cloud Architecture / Data / Artificial Intelligence

Next steps
Learn more with the following learning paths:

Create machine learning models


Build AI solutions with Azure Machine Learning
Data integration at scale with Azure Data Factory or Azure Synapse Pipeline
Sentiment Analysis with Cognitive Services in Azure Synapse Analytics
Text Analytics with Cognitive Services in Azure Synapse Analytics
For information about solution components, see these resources:

Azure Machine Learning documentation


Azure Synapse Analytics documentation
Cognitive Services Documentation
Power BI documentation
App Service overview
Train machine learning models in Azure Synapse Analytics
Machine learning model scoring for dedicated SQL pools in Azure Synapse
Analytics
Machine learning with Apache Spark in Azure Synapse Analytics

Related resources
Face recognition and sentiment analysis
Customer churn prediction using real-time analytics
Create personalized marketing
solutions in near real time
Azure Cosmos DB Azure Event Hubs Azure Functions Azure Machine Learning Azure Stream Analytics

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This architecture shows how you can create a solution personalizing offers with Azure
Functions, Azure Machine Learning, and Azure Stream Analytics.

Architecture

Input Events Azure Stream Analytics


Event Hub (Near Real-Time Aggregates)

Cold start
Product Affinity

Browser Personalized offer


logic

Machine Learning Raw stream data


(Product Affinity)

Cosmos DB Dashboard
(Azure Services)

Microsoft 
Azure

Download a Visio file of this architecture.


Dataflow
Event Hubs ingests raw click-stream data from Azure Functions and passes it on
to Stream Analytics.
Azure Stream Analytics aggregates clicks in near real time by product, offer, and
user. Writes to Azure Cosmos DB and also archives raw click-stream data to Azure
Storage.
Azure Cosmos DB stores aggregated data of clicks by user, product, and offers
user-profile information.
Azure Storage stores archived raw click-stream data from Stream Analytics.
Azure Functions takes in user clickstream data from websites and reads existing
user history from Azure Cosmos DB. These data are then run through the Machine
Learning web service or used along with the cold-start data in Azure Cache for
Redis to obtain product-affinity scores. Product-affinity scores are used with the
personalized-offer logic to determine the most relevant offer to present to the
user.
Azure Machine Learning helps you design, test, operationalize, and manage
predictive analytics solutions in the cloud.
Azure Cache for Redis stores pre-computed cold-start product affinity scores for
users without history.
Power BI enables visualization of user activity data and offers presented by
reading in data from Azure Cosmos DB.

Components
Event Hubs
Azure Stream Analytics
Azure Cosmos DB
Azure Storage
Azure Functions
Azure Machine Learning
Azure Cache for Redis
Power BI

Scenario details
Personalized marketing is essential for building customer loyalty and remaining
profitable. Reaching customers and getting them to engage is harder than ever, and
generic offers are easily missed or ignored. Current marketing systems fail to take
advantage of data that can help solve this problem.
Marketers using intelligent systems and analyzing massive amounts of data can deliver
highly relevant and personalized offers to each user, cutting through the clutter and
driving engagement. For example, retailers can provide offers and content based on
each customer's unique interests, preferences and product affinity, putting products in
front of the people most likely to buy them.

This architecture shows how you can create a solution personalizing offers with Azure
Functions, Azure Machine Learning, and Azure Stream Analytics.

Potential use cases


By personalizing your offers, you'll deliver an individualized experience for current and
prospective customers, boosting engagement and improving customer conversion,
lifetime value, and retention.

This solution is ideal for the retail and marketing industries.

Next steps
See the product documentation:

Learn more about Event Hubs


Learn more about Stream Analytics
Learn how to use Azure Cosmos DB
Learn more about Azure Storage
Learn how to create functions
Learn more about machine learning
Learn how to use Azure Cache for Redis
Learn about Power BI

Try a learning path:

Implement a Data Streaming Solution with Azure Streaming Analytics


Build a Machine Learning model
Create serverless apps with Azure Functions

Related resources
Read other Azure Architecture Center articles:

Big data architecture style


Scalable personalization on Azure
Search and query an enterprise
knowledge base by using Azure
OpenAI or Azure Cognitive Search
Azure Blob Storage Azure Cache for Redis Azure Cognitive Search Azure AI services

Azure Document Intelligence

This article describes how to use Azure OpenAI Service or Azure Cognitive Search to
search documents in your enterprise data and retrieve results to provide a ChatGPT-
style question and answer experience. This solution describes two approaches:

Embeddings approach: Use the Azure OpenAI embedding model to create


vectorized data. Vector search is a feature that significantly increases the semantic
relevance of search results.

Azure Cognitive Search approach: Use Azure Cognitive Search to search and
retrieve relevant text data based on a user query. This service supports full-text
search, semantic search, vector search, and hybrid search.

7 Note

In Azure Cognitive Search, the semantic search and vector search features are
currently in public preview.

Architecture: Embedding approach


Embedding creation Query and retrieval
5 5
Query

1
Storage Function apps Azure Cache for Azure App 1
User
accounts Redis 2 Service 4
2 4
3
Vectorize Return top k Results passed
Translate Create
Extract text query matching content with prompt
(optional) embeddings
3

Azure OpenAI Azure OpenAI


Azure Azure AI Azure OpenAI embedding language model
Translator Document embedding model
Intelligence model


Download a Visio file of this architecture.
Dataflow
Documents to be ingested can come from various sources, like files on an FTP server,
email attachments, or web application attachments. These documents can be ingested
to Azure Blob Storage via services like Azure Logic Apps, Azure Functions, or Azure Data
Factory. Data Factory is optimal for transferring bulk data.

Embedding creation:

1. The document is ingested into Blob Storage, and an Azure function is triggered to
extract text from the documents.

2. If documents are in a non-English language and translation is required, an Azure


function can call Azure Translator to perform the translation.

3. If the documents are PDFs or images, an Azure function can call Azure AI
Document Intelligence to extract the text. If the document is an Excel, CSV, Word,
or text file, python code can be used to extract the text.

4. The extracted text is then chunked appropriately, and an Azure OpenAI embedding
model is used to convert each chunk to embeddings.

5. These embeddings are persisted to the vector database. This solution uses the
Enterprise tier of Azure Cache for Redis, but any vector database can be used.

Query and retrieval:

1. The user sends a query via a user application.

2. The Azure OpenAI embedding model is used to convert the query into vector
embeddings.

3. A vector similarity search that uses this query vector in the vector database returns
the top k matching content. The matching content to be retrieved can be set
according to a threshold that’s defined by a similarity measure, like cosine
similarity.

4. The top k retrieved content and the system prompt are sent to the Azure OpenAI
language model, like GPT-3.5 Turbo or GPT-4.

5. The search results are presented as the answer to the search query that was
initiated by the user, or the search results can be used as the grounding data for a
multi-turn conversation scenario.
Architecture: Azure Cognitive Search pull
approach
Index creation Query and retrieval
Pull Query
API

1 2 1
Storage Azure Cognitive Azure App User
accounts Search Service
3 4
AI enrichment skillsets
(optional) Create system prompt Call language model
2 3

Azure AI Azure AI Azure OpenAI


Translator Document language model
Intelligence


Download a Visio file of this architecture.

Index creation:

1. Azure Cognitive Search is used to create a search index of the documents in Blob
Storage. Azure Cognitive Search supports Blob Storage, so the pull model is used
to crawl the content, and the capability is implemented via indexers.

7 Note

Azure Cognitive Search supports other data sources for indexing when using
the pull model. Documents can also be indexed from multiple data sources
and consolidated into a single index.

2. If certain scenarios require translation of documents, Azure Translator can be used,


which is a feature that's included in the built-in skill.

3. If the documents are nonsearchable, like scanned PDFs or images, AI can be


applied by using built-in or custom skills as skillsets in Azure Cognitive Search.
Applying AI over content that isn't full-text searchable is called AI enrichment.
Depending on the requirement, Azure AI Document Intelligence can be used as a
custom skill to extract text from PDFs or images via document analysis models,
prebuilt models, or custom extraction models.

If AI enrichment is a requirement, pull model (indexers) must be used to load an


index.

If vector fields are added to the index schema, which loads the vector data for
indexing, vector search can be enabled by indexing that vector data. Vector data
can be generated via Azure OpenAI embeddings.
Query and retrieval:

1. A user sends a query via a user application.

2. The query is passed to Azure Cognitive Search via the search documents REST API.
The query type can be simple, which is optimal for full-text search, or full, which is
for advanced query constructs like regular expressions, fuzzy and wild card search,
and proximity search. If the query type is set to semantic, a semantic search is
performed on the documents, and the relevant content is retrieved. Azure
Cognitive Search also supports vector search and hybrid search, which requires the
user query to be converted to vector embeddings.

3. The retrieved content and the system prompt are sent to the Azure OpenAI
language model, like GPT-3.5 Turbo or GPT-4.

4. The search results are presented as the answer to the search query that was
initiated by the user, or the search results can be used as the grounding data for a
multi-turn conversation scenario.

Architecture: Azure Cognitive Search push


approach
If the data source isn't supported, you can use the push model to upload the data to
Azure Cognitive Search.

Push Index creation Query and retrieval


API 2
Query

Azure App Azure Cognitive Azure App 1


Files Service User
Search Service
4
For translating and extracting text 3 3
(optional) Call language model
Create system prompt

Azure OpenAI
Azure AI Azure AI language model
Translator Document
1 Intelligence 2


Download a Visio file of this architecture.

Index creation:

1. If the document to be ingested must be translated, Azure Translator can be used.


2. If the document is in a nonsearchable format, like a PDF or image, Azure AI
Document Intelligence can be used to extract text.
3. The extracted text can be vectorized via Azure OpenAI embeddings vector search,
and the data can be pushed to an Azure Cognitive Search index via a Rest API or
an Azure SDK.

Query and retrieval:

The query and retrieval in this approach is the same as the pull approach earlier in this
article.

Components
Azure OpenAI provides REST API access to Azure OpenAI's language models
including the GPT-3, Codex, and the embedding model series for content
generation, summarization, semantic search, and natural language-to-code
translation. Access the service by using a REST API, Python SDK, or the web-based
interface in the Azure OpenAI Studio .

Azure AI Document Intelligence is an Azure AI service . It offers document


analysis capabilities to extract printed and handwritten text, tables, and key-value
pairs. Azure AI Document Intelligence provides prebuilt models that can extract
data from invoices, documents, receipts, ID cards, and business cards. You can also
use it to train and deploy custom models by using a custom template form model
or a custom neural document model.

Document Intelligence Studio provides a UI for exploring Azure AI Document


Intelligence features and models, and for building, tagging, training, and deploying
custom models.

Azure Cognitive Search is a cloud service that provides infrastructure, APIs, and
tools for searching. Use Azure Cognitive Search to build search experiences over
private disparate content in web, mobile, and enterprise applications.

Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for various languages, such as .NET, Node.js, and Python.
Applications can access files in Blob Storage via HTTP or HTTPS. Blob Storage
has hot, cool, and archive access tiers to support cost optimization for storing large
amounts of data.

The Enterprise tier of Azure Cache for Redis provides managed Redis Enterprise
modules, like RediSearch, RedisBloom, RedisTimeSeries, and RedisJSON. Vector
fields allow vector similarity search, which supports real-time vector indexing
(brute force algorithm (FLAT) and hierarchical navigable small world algorithm
(HNSW)), real-time vector updates, and k-nearest neighbor search. Azure Cache for
Redis brings a critical low-latency and high-throughput data storage solution to
modern applications.
Alternatives
Depending on your scenario, you can add the following workflows.

Use the Azure AI Language features, question answering and conversational


language understanding, to build a natural conversational layer over your data.
These features find appropriate answers for the input from your custom
knowledge base of information.

To create vectorized data, you can use any embedding model. You can also use the
Azure AI services Vision image retrieval API to vectorize images. This tool is
available in private preview.

Use the Durable Functions extension for Azure Functions as a code-first integration
tool to perform text-processing steps, like reading handwriting, text, and tables,
and processing language to extract entities on data based on the size and scale of
the workload.

You can use any database for persistent storage of the extracted embeddings,
including:
Azure SQL Database
Azure Cosmos DB
Azure Database for PostgreSQL
Azure Database for MySQL

Scenario details
Manual processing is increasingly time-consuming, error-prone, and resource-intensive
due to the sheer volume of documents. Organizations that handle huge volumes of
documents, largely unstructured data of different formats like PDF, Excel, CSV, Word,
PowerPoint, and image formats, face a significant challenge processing scanned and
handwritten documents and forms from their customers.

These documents and forms contain critical information, such as personal details,
medical history, and damage assessment reports, which must be accurately extracted
and processed.

Organizations often already have their own knowledge base of information, which can
be used for answering questions with the most appropriate answer. You can use the
services and pipelines described in these solutions to create a source for search
mechanisms of documents.
Potential use cases
This solution provides value to organizations in industries like pharmaceutical
companies and financial services. It applies to any company that has a large number of
documents with embedded information. This AI-powered end-to-end search solution
can be used to extract meaningful information from the documents based on the user
query to provide a ChatGPT-style question and answer experience.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Dixit Arora | Senior Customer Engineer, ISV DN CoE


Jyotsna Ravi | Principal Customer Engineer, ISV DN CoE

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is Azure AI Document Intelligence?
What is Azure OpenAI?
What is Azure Machine Learning?
Introduction to Blob Storage
What is Azure AI Language?
Introduction to Azure Data Lake Storage Gen2
Azure QnA Maker client library
Create, train, and publish your QnA Maker knowledge base
What is question answering?

Related resources
Query-based document summarization
Automate document identification, classification, and search by using Durable
Functions
Index file content and metadata by using Azure Cognitive Search
AI enrichment with image and text processing
AI at the edge with Azure Stack
Hub
Azure Container Registry Azure Kubernetes Service (AKS) Azure Machine Learning Azure Stack Hub

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This architecture shows how you can bring your trained AI model to the edge with Azure
Stack Hub and integrate it with your applications for low-latency intelligence.

Architecture

Download a Visio file of this architecture.

Dataflow
1. Data is processed using Azure Data Factory, to be placed on Azure Data Lake.
2. Data from Azure Data Factory is placed into the Azure Data Lake Storage for
training.
3. Data scientists train a model using Azure Machine Learning. The model is
containerized and put into an Azure Container Registry.
4. The model is deployed to a Kubernetes cluster on Azure Stack Hub.
5. The on-premises web application can be used to score data that's provided by the
end user, to score against the model that's deployed in the Kubernetes cluster.
6. End users provide data that's scored against the model.
7. Insights and anomalies from scoring are placed into a queue.
8. A function app gets triggered once scoring information is placed in the queue.
9. A function sends compliant data and anomalies to Azure Storage.
10. Globally relevant and compliant insights are available for consumption in Power BI
and a global app.
11. Feedback loop: The model retraining can be triggered by a schedule. Data
scientists work on the optimization. The improved model is deployed and
containerized as an update to the container registry.

Components
Key technologies used to implement this architecture:

Azure Machine Learning : Build, deploy, and manage predictive analytics


solutions.
Azure Data Factory : Ingest data into Azure Data Factory.
Azure Data Lake Storage : Load data into Azure Data Lake Storage Gen2 with
Azure Data Factory.
Container Registry : Store and manage container images across all types of Azure
deployments.
Azure Kubernetes Service (AKS) : Simplify the deployment, management, and
operations of Kubernetes.
Azure Storage : Durable, highly available, and massively scalable cloud storage.
Azure Stack Hub : Build and run innovative hybrid applications across cloud
boundaries.
Azure Functions : Event-driven serverless compute unit for on-demand tasks
running without the needs of maintaining the computing server.
Azure App Service: Path that captures end-user feedback data to enable model
optimization.

Scenario details
With the Azure AI tools, edge, and cloud platform, edge intelligence is possible. The
next generation of AI-enabled hybrid applications can run where your data lives. With
Azure Stack Hub, bring a trained AI model to the edge, integrate it with your
applications for low-latency intelligence, and continuously feedback into a refined AI
model for improved accuracy, with no tool or process changes for local applications.
This solution idea shows a connected Stack Hub scenario, where edge applications are
connected to Azure. For the disconnected-edge version of this scenario, see the article
AI at the edge - disconnected.

Potential use cases


There's a wide range of Edge AI applications that monitor and provide information in
near real-time. Areas where Edge AI can help include:

Security camera detection processes.


Image and video analysis (the media and entertainment industry).
Transportation and traffic (the automotive and mobility industry).
Manufacturing.
Energy (smart grids).

Next steps
Want to learn more? Check out the Introduction to Azure Stack module
Get Microsoft Certified for Azure Stack Hub with the Azure Stack Hub Operator
Associate certification
How to install the AKS Engine on Linux in Azure Stack Hub
How to install the AKS Engine on Windows in Azure Stack Hub
Deploy your ML models to an edge device with Azure Stack Edge Devices
Innovate further and deploy Azure Cognitive Services (Speech, Language, Decision,
Vision) containers to Azure Stack Hub

For more information about the featured Azure services, see the following articles and
samples:

App Service documentation


Azure Data Lake Storage Gen 2
Azure Kubernetes Service (AKS) documentation
Azure Machine Learning documentation
Azure Stack Hub documentation
Azure Stack Hub Deployment Options
Container Registry documentation
Storage documentation
AKS Engine on Azure Stack Hub (on GitHub)
Azure Samples - Edge Intelligence on Azure Stack Hub (on GitHub)
Azure Samples -Azure Stack Hub Foundation (on GitHub)
Azure hybrid and multicloud patterns and solutions documentation

Related resources
See the following related architectures:

Disconnected AI at the edge with Azure Stack Hub


Machine learning in Azure IoT Edge vision AI
Implement the Azure healthcare blueprint for AI
Deploy AI and machine learning computing on-premises and to the edge
AI-based footfall detection
Disconnected AI at the edge with
Azure Stack Hub
Azure Container Registry Azure HDInsight Azure Kubernetes Service (AKS) Azure Machine Learning

Azure Stack Hub

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This article outlines a solution for using edge AI when you're disconnected from the
internet. The solution uses Azure Stack Hub to move AI models to the edge.

Apache®, Apache Hadoop , Apache Spark , Apache HBase , and Apache Storm are
either registered trademarks or trademarks of the Apache Software Foundation in the
United States and/or other countries. No endorsement by the Apache Software Foundation
is implied by the use of these marks.

Architecture

Download a Visio file of this architecture.

Dataflow
1. Data scientists use Azure Machine Learning and an Azure HDInsight cluster to train
a machine learning model. The model is containerized and put into Azure
Container Registry.
2. The model is deployed to an Azure Kubernetes Service (AKS) cluster on Azure
Stack Hub.
3. End users provide data that's scored against the model.
4. Insights and anomalies from scoring are placed into storage for upload later.
5. Globally relevant and compliant insights are available in a global app.
6. Data scientists use scoring from the edge to improve the model.

Components
Machine Learning is a cloud-based environment that you can use to build,
deploy, and manage machine learning models. With these models, you can
forecast future behavior, outcomes, and trends.
HDInsight is a managed, full-spectrum, open-source analytics service in the
cloud for enterprises. You can use open-source frameworks with HDInsight, such as
Hadoop, Spark, HBase, and Storm.
Container Registry is a service that creates a managed registry of container
images. You can use Container Registry to build, store, and manage the images.
You can also use it to store containerized machine learning models.
AKS is a highly available, secure, and fully managed Kubernetes service. AKS
makes it easy to deploy and manage containerized applications.
Azure Virtual Machines is an infrastructure-as-a-service (IaaS) offer. You can use
Virtual Machines to deploy on-demand, scalable computing resources like
Windows and Linux virtual machines.
Azure Storage offers highly available, scalable, secure cloud storage for data,
applications, and workloads.
Azure Stack Hub is an extension of Azure that provides a way to run apps in an
on-premises environment and deliver Azure services to your datacenter.

Scenario details
With Azure AI tools and the Azure edge and cloud platform, edge intelligence is
possible. AI-enabled hybrid applications can run where your data lives, on-premises. By
using Azure Stack Hub, you can bring a trained AI model to the edge and integrate it
with your applications for low-latency intelligence. With this approach, you don't need
to make changes in tools or processes for local applications. When you use Azure Stack
Hub, you can ensure that your cloud solutions work even when you're disconnected
from the internet.

This solution is for a disconnected Azure Stack Hub scenario. Because of latency or
intermittent connectivity issues or regulations, you might not always be connected to
Azure. In disconnected scenarios, you can process data locally and aggregate it later in
Azure for further analysis. For the connected version of this scenario, see AI at the edge.

Potential use cases


You might need to deploy in a disconnected state in the following scenarios:

You have security or other restrictions that require you to deploy Azure Stack Hub
in an environment that isn't connected to the internet.
You want to block data (including usage data) from being sent to Azure.
You want to use Azure Stack Hub purely as a private cloud solution that's deployed
to your corporate intranet, and you aren't interested in hybrid scenarios.

Next steps
For more information about Azure Stack solutions, see the following resources:

Training module: Introduction to Azure Stack


Microsoft Certified: Azure Stack Hub Operator Associate
Install the AKS engine on Linux in Azure Stack Hub
Install the AKS engine on Windows in Azure Stack Hub
Azure Stack Edge managed devices that bring Azure AI to the edge
Use Azure Cognitive Services containers to make Azure APIs available on-premises

For more information about solution components, see the following product
documentation:

Azure App Service


AKS
Machine Learning
Azure Stack Hub documentation
Azure Stack Hub deployment options
Container Registry
HDInsight
Storage
Virtual machines in Azure
Azure hybrid and multicloud patterns and solutions documentation

For samples, see the following resources:

AKS engine on Azure Stack Hub (on GitHub)


Azure samples - edge intelligence on Azure Stack Hub (on GitHub)
Azure samples - Azure Stack Hub foundation (on GitHub)

Related resources
For related solutions, see the following articles:

AI at the edge with Azure Stack Hub


AI-based footfall detection
Deploy AI and machine learning computing on-premises and to the edge
Azure public multi-access edge compute deployment
Choose a bare-metal Kubernetes at the edge platform option
Video ingestion and object
detection on the edge and in the
cloud
Azure Stack Edge Azure Kubernetes Service (AKS) Azure SQL Edge Azure Container Registry

This article describes how to use a mobile robot with a live streaming camera to
implement various use cases. The solution implements a system that runs locally on
Azure Stack Edge to ingest and process the video stream and Azure AI services that
perform object detection.

Architecture
Anomaly
Ingestion and processing Object detection Visualization
detection

Key frames 6 8
1

2
Azure AI
Video ingest Anomaly Browser
services
and process detection
container container
3
5
7

Storage Azure SQL


AI Vision
account Edge database
container

Container Key Vault Azure Arc Azure Azure Kubernetes Azure Stack Edge
Azure Registry Monitor Service

Download a Visio file of this architecture.

Workflow
This workflow describes how the system processes the incoming data:

1. A camera that's installed on the robot streams video in real time by using Real
Time Streaming Protocol (RTSP).

2. A container in the Kubernetes cluster on Azure Stack Edge reads the incoming
stream and splits video into separate images. An open-source software tool called
FFmpeg ingests and processes the video stream.

3. Images are stored in the local Azure Stack Edge storage account.

4. Each time a new key frame is saved in the storage account, an AI Vision container
picks it up. For information about the separation of logic into multiple containers,
see Scenario details.

5. When it loads a key frame from the storage container, the AI Vision container
sends it to Azure AI services in the cloud. This architecture uses Azure AI Vision,
which enables object detection via image analysis.

6. The results of image analysis (detected objects and a confidence rating) are sent to
the anomaly detection container.

7. The anomaly detection container stores the results of image analysis and anomaly
detection in the local Azure Stack Edge Azure SQL database for future reference.
Using a local instance of the database improves access time, which helps to
minimize delays in data access.

8. Data processing is run to detect any anomalies in the incoming real-time video
stream. If anomalies are detected, a front-end UI shows an alert.

Components
Azure Stack Edge is used to host running Azure services on-premises, close to
the location where anomaly detection occurs, which reduces latency.

Azure Kubernetes Service on Azure Stack Edge is used to run a Kubernetes cluster
of containers that contain the system's logic on Azure Stack Edge in a simple and
managed way.

Azure Arc controls the Kubernetes cluster that runs on the edge device.

Azure AI Vision is used to detect objects in key frames of the video stream.

Azure Blob Storage is used to store images of key frames that are extracted from
the video stream.
Azure SQL Edge is used to store data on the edge, close to the service that
consumes and processes it.

Azure Container Registry is used to store Docker container images.

Azure Key Vault provides enhanced-security storage for any secrets or


cryptographic keys that are used by the system.

Azure Monitor provides observability for the system.

Scenario details
This architecture demonstrates a system that processes a real-time video stream,
compares the extracted real-time data with a set of reference data, and makes decisions
based on the results. For example, it could be used to provide scheduled inspections of
a fenced perimeter around a secured location.

The architecture uses Stack Edge to ensure that the most resource-intensive processes
are performed on-premises, close to the source of the video. This design significantly
improves the response time of the system, which is important when an immediate
response to an anomaly is critical.

Because the parts of the system are deployed as independent containers in a


Kubernetes cluster, you can scale only the required subsystems according to demand.
For example, if you increase the number of cameras for the video feed, you can scale the
container that's responsible for video ingestion and processing to handle the demand
but keep the rest of the cluster at the original level.

Offloading the object detection functionality to Azure AI services significantly reduces


the expertise that you need to deploy this architecture. Unless your requirements for
object detection are highly specialized, the out-of-the-box approach you get from the
Image Analysis service is sufficient and doesn't require knowledge of machine learning.

Potential use cases


Monitoring the security of a perimeter

Detecting an unsafe working environment in a factory

Detecting anomalies in an automated assembly line

Detecting a lack of de-icing fluid on aircraft


Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that you can use to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.

One of the biggest advantages of using Azure Stack Edge is that you get fully managed
components on your on-premises hardware. All fully managed Azure components are
automatically resilient at a regional level.

In addition, running the system in a Kubernetes cluster enables you to offload the
responsibility for keeping the subsystems healthy to the Kubernetes orchestration
system.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Microsoft Entra managed identities provide security for all components of this
architecture. Using managed identities eliminates the need to store secrets in code or
configuration files. It simplifies access control, credential management, and role
assignment.

Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.

To see a pricing example for this scenario, use the Azure pricing calculator . The most
expensive components in the scenario are Azure Stack Edge and Azure Kubernetes
Service. These services provide capacity for scaling the system to address increased
demand in the future.

The cost of using Azure AI services for object detection varies based on how long the
system runs. The preceding pricing example is based on a system that produces one
image per second and operates for 8 hours per day. One FPS is sufficient for this
scenario. However, if your system needs to run for longer periods of time, the cost of
using Azure AI services is higher:

Medium workload. 12 hours per day .


High workload. 24 hours per day .

Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.

Because the code is deployed in a Kubernetes cluster, you can take advantage of the
benefits of this powerful orchestration system. Because the various subsystems are
separated into containers, you can scale only the most demanding parts of the
application. At a basic level, with one incoming video feed, the system can contain just
one node in a cluster. This design significantly simplifies the initial configuration. As
demand for data processing grows, you can easily scale the cluster by adding nodes.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Nick Sologoub | Principal Software Engineering Lead

Other contributors:

Mick Alberts | Technical Writer


Frédéric Le Coquil | Principal Software Engineer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Product documentation:

Object detection
Responsible use of AI
What is Azure Stack Edge Pro 2?
Azure Kubernetes Service
Azure Arc overview

Guided learning path:

Bring Azure innovation to your hybrid environments with Azure Arc


Introduction to Azure Kubernetes Service
Introduction to Azure Stack
Analyze images with the Computer Vision service

Related resources
Image classification on Azure
AI enrichment with image and
text processing
Azure App Service Azure Blob Storage Azure Cognitive Search Azure Functions

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This article presents a solution that enriches text and image documents by using image
processing, natural language processing, and custom skills to capture domain-specific
data. Azure Cognitive Search with AI enrichment can help identify and explore relevant
content at scale. This solution uses AI enrichment to extract meaning from the original
complex, unstructured JFK Assassination Records (JFK Files) dataset.

Architecture

Blob Storage Cognitive Search Built-in skills 3

Text Computer
Translator
Analytics Vision
Ingestion Enrich Index Query
Documents
Custom skills
4
1 2 5 7
Document Enriched Search Web
cracking documents index application

Azure Form Azure


Images
Functions Recognizer Machine
Learning
Query
Unstructured data AI enrichment

Projections 8
6

Microsoft
Azure
Blob Storage Table Storage
Knowledge store

Download a Visio file of this architecture.

Dataflow
The above diagram illustrates the process of passing the unstructured JFK Files dataset
through the Azure Cognitive Search skills pipeline to produce structured, indexable data:

1. Unstructured data in Azure Blob Storage, such as documents and images, ingest
into Azure Cognitive Search.
2. The document cracking step initiates the indexing process by extracting images
and text from the data, followed by content enrichment. The enrichment steps that
occur in this process depend on the data and type of skills selected.
3. Built-in skills based on the Computer Vision and Language Service APIs enable AI
enrichments including image optical character recognition (OCR), image analysis,
text translation, entity recognition, and full-text search.
4. Custom skills support scenarios that require more complex AI models or services.
Examples include Forms Recognizer, Azure Machine Learning models, and Azure
Functions.
5. Following the enrichment process, the indexer saves the outputs into a search
index that contains the enriched and indexed documents. Full-text search and
other query forms can use this index.
6. The enriched documents can also project into a knowledge store, which
downstream apps like knowledge mining or data science can use.
7. Queries access the enriched content in the search index. The index supports
custom analyzers, fuzzy search queries, filters, and a scoring profile to tune search
relevance.
8. Any application that connects to Blob Storage or to Azure Table Storage can access
the knowledge store.

Components
Azure Cognitive Search works with other Azure components to provide this solution.

Azure Cognitive Search

Azure Cognitive Search indexes the content and powers the user experience in this
solution. Azure Cognitive Search can apply pre-built cognitive skills to the content, and
the extensibility mechanism can add custom skills for specific enrichment
transformations.

Azure Computer Vision


Azure Computer Vision uses text recognition to extract and recognize text information
from images. The Read API uses the latest OCR recognition models, and is optimized for
large, text-heavy documents and noisy images.
The legacy OCR API isn't optimized for large documents, but supports more
languages. OCR results can vary depending on scan and image quality. The current
solution idea uses OCR to produce data in the hOCR format .

Azure Cognitive Service for Language


Azure Cognitive Service for Language extracts text information from unstructured
documents by using text analytics capabilities like Named Entity Recognition (NER), key
phrase extraction, and full-text search.

Azure Storage

Azure Blob Storage is REST-based object storage for data that you can access from
anywhere in the world via HTTPS. You can use Blob Storage to expose data publicly to
the world or to store application data privately. Blob Storage is ideal for large amounts
of unstructured data like text or graphics.

Azure Table Storage stores highly available, scalable, structured or semi-structured


NoSQL data in the cloud.

Azure Functions

Azure Functions is a serverless compute service that lets you run small pieces of
event-triggered code without having to explicitly provision or manage infrastructure.
This solution uses an Azure Functions method to apply the CIA Cryptonyms list to the
JFK Assassination Records as a custom skill.

Azure App Service


This solution idea also builds a standalone web app in Azure App Service to test,
demonstrate, search the index, and explore connections in the enriched and indexed
documents.

Scenario details
Large, unstructured datasets can include typewritten and handwritten notes, photos and
diagrams, and other unstructured data that standard search solutions can't parse. The
JFK Assassination Records contain over 34,000 pages of documents about the CIA
investigation of the 1963 JFK assassination.
The JFK Files sample project and online demo showcase a particular Azure
Cognitive Search use case. This solution idea isn't intended to be a framework or
scalable architecture for all scenarios, but to provide a general guideline and example.
The code project and demo create a public website and publicly readable storage
container for extracted images, so you shouldn't use this solution with non-public data.

AI enrichment in Azure Cognitive Search can extract and enhance searchable, indexable
text from images, blobs, and other unstructured data sources like the JFK Files. AI
enrichment uses pre-trained machine learning skill sets from the Cognitive Services
Computer Vision and Cognitive Service for Language APIs. You can also create and
attach custom skills to add special processing for domain-specific data like CIA
Cryptonyms. Azure Cognitive Search can then index and search that context.

The Azure Cognitive Search skills in this solution fall into the following categories:

Image processing. Built-in text extraction and image analysis skills include object
and face detection, tag and caption generation, and celebrity and landmark
identification. These skills create text representations of image content, which are
searchable by using the query capabilities of Azure Cognitive Search. Document
cracking is the process of extracting or creating text content from non-text sources.

Natural language processing. Built-in skills like entity recognition, language


detection, and key phrase extraction map unstructured text to searchable and
filterable fields in an index.

Custom skills extend Azure Cognitive Search to apply specific enrichment


transformations to content. You specify the interface for a custom skill through the
Custom Web API skill.

Potential use cases


Increase the value and utility of unstructured text and image content in search and
data science apps.
Use custom skills to integrate open-source, third-party, or first-party code into
indexing pipelines.
Make scanned JPG, PNG, or bitmap documents full-text searchable.
Produce better outcomes than standard PDF text extraction for PDFs with
combined image and text. Some scanned and native PDF formats might not parse
correctly in Azure Cognitive Search.
Create new information from inherently meaningful raw content or context that's
hidden in larger unstructured or semi-structured documents.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributor.

Principal author:

Carlos Alexandre Santos | Senior Specialized AI Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Learn more about this solution:

Explore the JFK Files project on GitHub.


Watch the process in action in an online video.
Explore the JFK Files online demo .

Read product documentation:

AI enrichment in Azure Cognitive Search


What is Computer Vision?
What is Azure Cognitive Service for Language?
What is optical character recognition?
What is Named Entity Recognition (NER) in Azure Cognitive Service for Language?
Introduction to Azure Blob Storage
Introduction to Azure Functions

Try the learning path:

Implement knowledge mining with Azure Cognitive Search

Related resources
See the related architectures and guidance:

Intelligent product search engine for e-commerce


Process free-form text for search
Keyword search and speech-to-text with OCR digital media
Suggest content tags with NLP using deep learning
Knowledge mining for content research
Deploy machine learning models
to AKS with Kubeflow
Azure Blob Storage Azure Container Registry Azure Kubernetes Service (AKS)

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This article presents a solution for real-time inferencing on Azure Kubernetes Service
(AKS).

Architecture

App developer

Query the model for AI


6
Azure Kubernetes Service features in app

Machine Learning
model in containers

Serve the model in


4
production

Kubeflow
3

Azure Container
Data scientist 2
Registry Parameter GPU-enabled
Worker nodes
server nodes Virtual Machines

Azure Blob
storage

Microsoft

Azure

Download a Visio file of this architecture.


Dataflow
1. A machine learning model is packaged into a container and published to Azure
Container Registry.
2. Azure Blob Storage hosts training data sets and the trained model.
3. Kubeflow is used to deploy training jobs to AKS, including parameter servers and
worker nodes.
4. Kubeflow is used to make a production model available. This step promotes a
consistent environment across testing, control, and production.
5. AKS supports GPU-enabled VMs.
6. Developers build features to query the model that runs in an AKS cluster.

Components
Blob Storage is a service that's part of Azure Storage . Blob Storage offers
optimized cloud object storage for large amounts of unstructured data.
Container Registry builds, stores, and manages container images and can store
containerized machine learning models.
AKS is a highly available, secure, and fully managed Kubernetes service. AKS
makes it easy to deploy and manage containerized applications.
Machine Learning is a cloud-based environment that you can use to train,
deploy, automate, manage, and track machine learning models. You can use the
models to forecast future behavior, outcomes, and trends.

Scenario details
AKS is useful when you need high-scale production deployments of your machine
learning models. A high-scale deployment involves a fast response time, autoscaling of
the deployed service, and logging. For more information, see Deploy a model to an
Azure Kubernetes Service cluster.

This solution uses Kubeflow to manage the deployment to AKS. The machine learning
models run on AKS clusters that are backed by GPU-enabled virtual machines (VMs).

Potential use cases


This solution applies to scenarios that use AKS and GPU-enabled VMs for machine
learning. Examples include:

Image classification systems.


Natural language processing algorithms.
Predictive maintenance systems.

Next steps
What is Azure Machine Learning?
Azure Kubernetes Service (AKS)
Deploy a model to an Azure Kubernetes Service cluster
Kubeflow on Azure
What is Azure Blob Storage?
Introduction to container registries in Azure

Related resources
Artificial intelligence (AI) - Architectural overview
Orchestrate MLOps by using
Azure Databricks
Azure Databricks

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This article provides a machine learning operations (MLOps) architecture and process
that uses Azure Databricks. This process defines a standardized way to move machine
learning models and pipelines from development to production, with options to include
automated and manual processes.

Architecture

Download a Visio file of this architecture.

Workflow
This solution provides a robust MLOps process that uses Azure Databricks. All elements
in the architecture are pluggable, so you can integrate other Azure and third-party
services throughout the architecture as needed. This architecture and description are
adapted from the e-book The Big Book of MLOps . This e-book explores the
architecture described here in more detail.

Source control: This project's code repository organizes the notebooks, modules,
and pipelines. Data scientists create development branches to test updates and
new models. Code is developed in notebooks or in IDEs, backed by Git, with
Databricks Repos integration for syncing with your Azure Databricks workspaces.
Source control promotes machine learning pipelines from development, through
staging (for testing), to production (for deployment).

Lakehouse - production data: Data scientists work in the development


environment, where they have read-only access to production data. (Alternatively,
data can be mirrored or redacted.) They also have read/write access to a dev
storage environment for development and experimentation. We recommend a
Lakehouse architecture for data, in which data is stored in Azure Data Lake
Storage in Delta Lake format. Access controls are defined with Microsoft Entra
credential passthrough or table access controls.

Development
In the development environment, data scientists and engineers develop machine
learning pipelines.

1. Exploratory data analysis (EDA): Data scientists explore data in an interactive,


iterative process. This ad hoc work might not be deployed to staging or
production. Tools might include Databricks SQL, dbutils.data.summarize, and
AutoML.

2. Model training and other machine learning pipelines: Machine learning pipelines
are developed as modular code in notebooks and/or IDEs. For example, the model
training pipeline reads data from the Feature Store and other Lakehouse tables.
Training and tuning log model parameters and metrics to the MLflow tracking
server. The Feature Store API logs the final model. These logs link the model, its
inputs, and the training code.

3. Commit code: To promote the machine learning workflow toward production, the
data scientist commits the code for featurization, training, and other pipelines to
source control.
Staging
In the staging environment, CI infrastructure tests changes to machine learning pipelines
in an environment that mimics production.

4. Merge request: When a merge (or pull) request is submitted against the staging
(main) branch of the project in source control, a continuous integration and
continuous delivery (CI/CD) tool like Azure DevOps runs tests.

5. Unit and CI tests: Unit tests run in CI infrastructure, and integration tests run end-
to-end workflows on Azure Databricks. If tests pass, the code changes merge.

6. Build a release branch: When machine learning engineers are ready to deploy the
updated machine learning pipelines to production, they can build a new release. A
deployment pipeline in the CI/CD tool redeploys the updated pipelines as new
workflows.

Production

Machine learning engineers manage the production environment, where machine


learning pipelines directly serve end applications. The key pipelines in production
refresh feature tables, train and deploy new models, run inference or serving, and
monitor model performance.

7. Feature table refresh: This pipeline reads data, computes features, and writes to
Feature Store tables. It runs continuously in streaming mode, runs on a schedule,
or is triggered.

8. Model training: In production, the model training or retraining pipeline is either


triggered or scheduled to train a fresh model on the latest production data.
Models are registered to the MLflow Model Registry.

9. Continuous deployment: Registering new model versions triggers the CD pipeline,


which runs tests to ensure that the model will perform well in production. As the
model passes tests, its progress is tracked in the Model Registry via model stage
transitions. Registry webhooks can be used for automation. Tests can include
compliance checks, A/B tests to compare the new model with the current
production model, and infrastructure tests. Test results and metrics are recorded in
Lakehouse tables. You can optionally require manual sign-offs before models are
transitioned to production.

10. Model deployment: As a model enters production, it's deployed for scoring or
serving. The most common deployment modes are:
Batch or streaming scoring: For latencies of minutes or longer, batch and
streaming are the most cost-effective options. The scoring pipeline reads the
latest data from the Feature Store, loads the latest production model version
from the Model Registry, and performs inference in a Databricks job. It can
publish predictions to Lakehouse tables, a Java Database Connectivity (JDBC)
connection, flat files, message queues, or other downstream systems.
Online serving (REST APIs): For low-latency use cases, online serving is
generally necessary. MLflow can deploy models to MLflow Model Serving on
Azure Databricks, cloud provider serving systems, and other systems. In all
cases, the serving system is initialized with the latest production model from
the Model Registry. For each request, it fetches features from an online
Feature Store and makes predictions.

11. Monitoring: Continuous or periodic workflows monitor input data and model
predictions for drift, performance, and other metrics. Delta Live Tables can simplify
the automation of monitoring pipelines, storing the metrics in Lakehouse tables.
Databricks SQL, Power BI, and other tools can read from those tables to create
dashboards and alerts.

12. Retraining: This architecture supports both manual and automatic retraining.
Scheduled retraining jobs are the easiest way to keep models fresh.

Components
Data Lakehouse . A Lakehouse architecture unifies the best elements of data
lakes and data warehouses, delivering data management and performance
typically found in data warehouses with the low-cost, flexible object stores offered
by data lakes.
Delta Lake is the recommended choice for an open-source data format for a
lakehouse. Azure Databricks stores data in Data Lake Storage and provides a
high-performance query engine.
MLflow is an open-source project for managing the end-to-end machine
learning lifecycle. These are its main components:
Tracking allows you to track experiments to record and compare parameters,
metrics, and model artifacts.
Databricks Autologging extends MLflow automatic logging to track
machine learning experiments, automatically logging model parameters,
metrics, files, and lineage information.
MLFlow Model allows you to store and deploy models from any machine
learning library to various model serving and inference platforms.
Model Registry provides a centralized model store for managing model
lifecycle stage transitions from development to production.
Model Serving enables you to host MLflow models as REST endpoints.
Azure Databricks . Azure Databricks provides a managed MLflow service with
enterprise security features, high availability, and integrations with other Azure
Databricks workspace features.
Databricks Runtime for Machine Learning automates the creation of a cluster
that's optimized for machine learning, preinstalling popular machine learning
libraries like TensorFlow, PyTorch, and XGBoost in addition to Azure Databricks
for Machine Learning tools like AutoML and Feature Store clients.
Feature Store is a centralized repository of features. It enables feature sharing
and discovery, and it helps to avoid data skew between model training and
inference.
Databricks SQL. Databricks SQL provides a simple experience for SQL queries
on Lakehouse data, and for visualizations, dashboards, and alerts.
Databricks Repos provides integration with your Git provider in the Azure
Databricks workspace, simplifying collaborative development of notebooks or
code and IDE integration.
Workflows and jobs provide a way to run non-interactive code in an Azure
Databricks cluster. For machine learning, jobs provide automation for data
preparation, featurization, training, inference, and monitoring.

Alternatives
You can tailor this solution to your Azure infrastructure. Common customizations
include:

Multiple development workspaces that share a common production workspace.


Exchanging one or more architecture components for your existing infrastructure.
For example, you can use Azure Data Factory to orchestrate Databricks jobs.
Integrating with your existing CI/CD tooling via Git and Azure Databricks REST
APIs.

Scenario details
MLOps helps to reduce the risk of failures in machine learning and AI systems and to
improve the efficiency of collaboration and tooling. For an introduction to MLOps and
an overview of this architecture, see Architecting MLOps on the Lakehouse .

By using this architecture, you can:


Connect your business stakeholders with machine learning and data science
teams. This architecture allows data scientists to use notebooks and IDEs for
development. It enables business stakeholders to view metrics and dashboards in
Databricks SQL, all within the same Lakehouse architecture.
Make your machine learning infrastructure datacentric. This architecture treats
machine learning data (data from feature engineering, training, inference, and
monitoring) just like other data. It reuses tooling for production pipelines,
dashboarding, and other general data processing for machine learning data
processing.
Implement MLOps in modules and pipelines. As with any software application,
the modularized pipelines and code in this architecture enable testing of individual
components and decrease the cost of future refactoring.
Automate your MLOps processes as needed. In this architecture, you can
automate steps to improve productivity and reduce the risk of human error, but
not every step needs to be automated. Azure Databricks permits UI and manual
processes in addition to APIs for automation.

Potential use cases


This architecture applies to all types of machine learning, deep learning, and advanced
analytics. Common machine learning / AI techniques used in this architecture include:

Classical machine learning, like linear models, tree-based models, and boosting.
Modern deep learning, like TensorFlow and PyTorch.
Custom analytics, like statistics, Bayesian methods, and graph analytics.

The architecture supports both small data (single machine) and large data (distributed
computing and GPU-accelerated). In each stage of the architecture, you can choose
compute resources and libraries to adapt to your data and problem dimensions.

The architecture applies to all types of industries and business use cases. Azure
Databricks customers using this and similar architectures include small and large
organizations in industries like these:

Consumer goods and retail services


Financial services
Healthcare and life sciences
Information technology

For examples, see the Databricks website .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Brandon Cowen | Senior Cloud Solution Architect

Other contributor:

Mick Alberts | Technical Writer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
The Big Book of MLOps
Need for Data-centric ML Platforms (introduction to MLOps)
Databricks Machine Learning in-product quickstart
10-minute tutorials: Get started with machine learning on Azure Databricks
Databricks Machine Learning documentation
Databricks Machine Learning product page and resources
MLOps on Databricks: A How-To Guide
Automating the ML Lifecycle With Databricks Machine Learning
MLOps on Azure Databricks with MLflow
Machine Learning Engineering for the Real World
Automate Your Machine Learning Pipeline
Databricks Academy
Databricks Academy GitHub project (free training)
MLOps glossary
Three Principles for Selecting Machine Learning Platforms
What is a Lakehouse?
Delta Lake home page
Ingest data into the Azure Databricks Lakehouse
Clusters
Libraries
MLflow Documentation
Azure Databricks MLflow guide
Share models across workspaces
Notebooks
Developer tools and guidance
Deploy MLflow models to online endpoints in Azure Machine Learning
Deploy to Azure Kubernetes Service (AKS)
Related resources
MLOps framework to upscale machine learning lifecycle with Azure Machine
Learning
MLOps v2
MLOps maturity model
Deploy AI and machine learning
computing on-premises and to
the edge
Azure Container Registry Azure IoT Edge Azure Machine Learning Azure Stack Edge

This reference architecture illustrates how to use Azure Stack Edge to extend rapid
machine learning inference from the cloud to on-premises or edge scenarios. Azure
Stack Hub delivers Azure capabilities such as compute, storage, networking, and
hardware-accelerated machine learning to any edge location.

Architecture
On-premises Azure

Training data
Azure Blob
storage

Azure IoT
Hub
Azure Stack
Edge

Model

Azure Machine
Learning

Genrated
model

Azure
Container
Registry
Sampled data Stored model


Download a Visio file of this architecture.

Workflow
The architecture consists of the following steps:

Azure Machine Learning. Machine Learning lets you build, train, deploy, and
manage machine learning models in a cloud-based environment. These models
can then deploy to Azure services, including (but not limited to) Azure Container
Instances, Azure Kubernetes Service (AKS), and Azure Functions.
Azure Container Registry. Container Registry is a service that creates and manages
the Docker Registry. Container Registry builds, stores, and manages Docker
container images and can store containerized machine learning models.
Azure Stack Edge. Azure Stack Edge is an edge computing device that's designed
for machine learning inference at the edge. Data is preprocessed at the edge
before transfer to Azure. Azure Stack Edge includes compute acceleration
hardware that's designed to improve performance of AI inference at the edge.
Local data. Local data references any data that's used in the training of the
machine learning model. The data can be in any local storage solution, including
Azure Arc deployments.

Components
Azure Machine Learning
Azure Container Registry
Azure Stack Edge
Azure IoT Hub
Azure Blob Storage

Scenario details

Potential use cases


This solution is ideal for the telecommunications industry. Typical uses for extending
inference include when you need to:

Run local, rapid machine learning inference against data as it's ingested and you
have a significant on-premises hardware footprint.
Create long-term research solutions where existing on-premises data is cleaned
and used to generate a model. The model is then used both on-premises and in
the cloud; it's retrained regularly as new data arrives.
Build software applications that need to make inferences about users, both at a
physical location and online.

Recommendations

Ingesting, transforming, and transferring data stored


locally
Azure Stack Edge can transform data sourced from local storage before transferring that
data to Azure. This transformation is done by an Azure IoT Edge device that's deployed
on the Azure Stack Edge device. These IoT Edge devices are associated with an Azure IoT
Hub resource on the Azure cloud platform.

Each IoT Edge module is a Docker container that does a specific task in an ingest,
transform, and transfer workflow. For example, an IoT Edge module can collect data
from an Azure Stack Edge local share and transform the data into a format that's ready
for machine learning. Then, the module transfers the transformed data to an Azure Stack
Edge cloud share. You can add custom or built-in modules to your IoT Edge device or
develop custom IoT Edge modules..

7 Note

IoT Edge modules are registered as Docker container images in Container Registry.

In the Azure Stack Edge resource on the Azure cloud platform, the cloud share is backed
by an Azure Blob storage account resource. All data in the cloud share will automatically
upload to the associated storage account. You can verify the data transformation and
transfer by either mounting the local or cloud share, or by traversing the Azure Storage
account.

Training and deploying a model


After preparing and storing data in Blob storage, you can create a Machine Learning
dataset that connects to Azure Storage. A dataset represents a single copy of your data
in storage that's directly referenced by Machine Learning.

You can use the Machine Learning command-line interface (CLI), the R SDK , the
Python SDK, designer, or Visual Studio Code to build the scripts that are required to
train your model.
After training and readying the model to deploy, you can deploy it to various Azure
services, including but not limited to:

Azure Container Registry. You can deploy the models to a private Docker Registry
such as Azure Container Registry since they are Docker container images.
Azure Container Instances. You can deploy the model's Docker container image
directly to a container group.
Azure Kubernetes Service. You can use Azure Kubernetes Service to automatically
scale the model's Docker container image for high-scale production deployments.
Azure Functions. You can package a model to run directly on a Functions instance.
Azure Machine Learning. You can use Compute instances, managed cloud-based
development workstations, for both training and inference of models. You can also
similarly deploy the model to on-premises IoT Edge and Azure Stack Edge devices.

7 Note

For this reference architecture, the model deploys to Azure Stack Edge to make the
model available for inference on-premises. The model also deploys to Container
Registry to ensure that the model is available for inference across the widest variety
of Azure services.

Inference with a newly deployed model


Azure Stack Edge can quickly run machine learning models locally against data on-
premises by using its built-in compute acceleration hardware. This computation occurs
entirely at the edge. The result is rapid insights from data by using hardware that's
closer to the data source than a public cloud region.

Additionally, Azure Stack Edge continues to transfer data to Machine Learning for
continuous retraining and improvement by using a machine learning pipeline that's
associated with the model that's already running against data stored locally.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Availability
Consider placing your Azure Stack Edge resource in the same Azure region as
other Azure services that will access it. To optimize upload performance, consider
placing your Azure Blob storage account in the region where your appliance has
the best network connection.
Consider Azure ExpressRoute for a stable, redundant connection between your
device and Azure.

Manageability
Administrators can verify that the data source from local storage has transferred to
the Azure Stack Edge resource correctly. They can verify by mounting the Server
Message Block (SMB)/Network File System (NFS) file share or connecting to the
associated Blob storage account by using Azure Storage Explorer .
Use Machine Learning datasets to reference your data in Blob storage while
training your model. Referencing storage eliminates the need to embed secrets,
data paths, or connection strings in your training scripts.
In your Machine Learning workspace, register and track ML models to track
differences between your models at different points in time. You can similarly
mirror the versioning and tracking metadata in the tags that you use for the
Docker container images that deploy to Container Registry.

DevOps
Review the MLOps lifecycle management approach for Machine Learning. For
example, use GitHub or Azure Pipelines to create a continuous integration process
that automatically trains and retrains a model. Training can be triggered either
when new data populates the dataset or a change is made to the training scripts.
The Azure Machine Learning workspace will automatically register and manage
Docker container images for machine learning models and IoT Edge modules.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

Use the Azure pricing calculator to estimate costs.


Azure Stack Edge pricing is calculated as a flat-rate monthly subscription with a
one-time shipping fee.
Azure Machine Learning also deploys Container Registry, Azure Storage, and Azure
Key Vault services, which incur extra costs. For more information, see How Azure
Machine Learning works: Architecture and concepts.
Azure Machine Learning pricing includes charges for the virtual machines that
are used for training the model in the public cloud.

Next steps
Product documentation

What is Azure Machine Learning?


Azure Container Registry
Azure Stack Edge

Microsoft Learn modules:

Get started with AI on Azure


Work with data in Azure Machine Learning

Related resources
Build an enterprise-grade conversational bot
Image classification on Azure
Many models machine learning
(ML) at scale in Azure with Spark
Azure Data Factory Azure Data Lake Azure Databricks Azure Machine Learning Azure Synapse Analytics

This article describes an architecture for many models that uses Apache Spark in either
Azure Databricks or Azure Synapse Analytics. Spark is a powerful tool for the large and
complex data transformations that some solutions require.

7 Note

Use Spark versions 3.0 and later for many models applications. The data
transformation capabilities and support for Python and pandas are much better
than in earlier versions.

A companion article, Many models machine learning (ML) at scale with Azure Machine
Learning, uses Machine Learning and compute clusters.

Architecture

Download a Visio file of this architecture.

Dataflow
1. Data ingestion: Azure Data Factory pulls data from a source database and copies it
to Azure Data Lake Storage.
2. Model-training pipeline:
a. Prepare data: The training pipeline pulls the data from Data Lake Storage and
uses Spark to group it into datasets for training the models.
b. Train models: The pipeline trains models for all the datasets that were created
during data preparation. It uses the pandas function API to train multiple
models in parallel. After a model is trained, the pipeline registers it into Machine
Learning along with the testing metrics.
3. Model-promotion pipeline:
a. Evaluate models: The promotion pipeline evaluates the trained models before
moving them to production. A DevOps pipeline applies business logic to
determine whether a model meets the criteria for deployment. For example, it
might check that the accuracy of the testing data is over 80 percent.
b. Register models: The promotion pipeline registers the models that qualify to
the production Machine Learning workspace.
4. Model batch-scoring pipeline:
a. Prepare data: The batch-scoring pipeline pulls data from Data Lake Storage and
uses Spark to group it into datasets for scoring.
b. Score models: The pipeline uses the pandas function API to score multiple
datasets in parallel. It finds the appropriate model for each dataset in Machine
Learning by searching the model tags. Then it downloads the model and uses it
to score the dataset. It uses the Spark connector to Synapse SQL to retain the
results.
5. Real-time scoring: Azure Kubernetes Service (AKS) can do real-time scoring if
needed. Because of the large number of models, they should be loaded on
demand, not pre-loaded.
6. Results:
a. Predictions: The batch-scoring pipeline saves predictions to Synapse SQL.
b. Metrics: Power BI connects to the model predictions to retrieve and aggregate
results for presentation.

Components
Azure Machine Learning is an enterprise-grade ML service for building and
deploying models quickly. It provides users at all skill levels with a low-code
designer, automated ML (AutoML), and a hosted Jupyter notebook environment
that supports various IDEs.
Azure Synapse Analytics is an analytics service that unifies data integration,
enterprise data warehousing, and big data analytics.
Synapse SQL is a distributed query system for T-SQL that enables data
warehousing and data virtualization scenarios and extends T-SQL to address
streaming and ML scenarios. It offers both serverless and dedicated resource
models.
Azure Data Lake Storage is a massively scalable and secure storage service for
high-performance analytics workloads.
Azure Kubernetes Service (AKS) is a fully managed Kubernetes service for
deploying and managing containerized applications. AKS simplifies deployment of
a managed AKS cluster in Azure by offloading the operational overhead to Azure.
Azure DevOps is a set of developer services that provide comprehensive
application and infrastructure lifecycle management. DevOps includes work
tracking, source control, build and CI/CD, package management, and testing
solutions.
Microsoft Power BI is a collection of software services, apps, and connectors that
work together to turn unrelated sources of data into coherent, visually immersive,
and interactive insights.

Alternatives
You can use Spark in Azure Synapse instead of Spark in Azure Databricks for model
training and scoring.
The source data can come from any database.
You can use a managed online endpoint or AKS to deploy real-time inferencing.

Scenario details
Many machine learning (ML) problems are too complex for a single ML model to solve.
Whether it's predicting sales for every item of every store, or modeling maintenance for
hundreds of oil wells, having a model for each instance might improve results on many
ML problems. This many models pattern is very common across a wide variety of
industries, and applies to many real-world use cases. With the use of Azure Machine
Learning, an end-to-end many models pipeline can include model training, batch-
inferencing deployment, and real-time deployment.

A many models solution requires a different dataset for every model during training and
scoring. For instance, if the task is to predict sales for every item of every store, every
dataset will be for a unique item-store combination.

Potential use cases


Retail: A grocery store chain needs to create a separate revenue forecast model for
each store and item, totaling over 1,000 models per store.
Supply chain: For each combination of warehouse and product, a distribution
company needs to optimize inventory.
Restaurants: A chain with thousands of franchises needs to forecast the demand
for each.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Data partitions Partitioning the data is the key to implementing the many models
pattern. If you want one model per store, a dataset comprises all the data for one
store, and there are as many datasets as there are stores. If you want to model
products by store, there will be a dataset for every combination of product and
store. Depending on the source data format, it might be easy to partition the data,
or it can require extensive data shuffling and transformation. Spark and Synapse
SQL scale very well for such tasks, while Python pandas doesn't, since it runs only
on one node and process.
Model management: The training and scoring pipelines identify and invoke the
right model for each dataset. To do this, they calculate tags that characterize the
dataset, and then use the tags to find the matching model. The tags identify the
data partition key and the model version, and might also provide other
information.
Choosing the right architecture:
Spark is appropriate when your training pipeline has complex data
transformation and grouping requirements. It provides flexible splitting and
grouping techniques to group data by combinations of characteristics, such as
product-store or location-product. The results can be placed in a Spark
DataFrame for use in subsequent steps.
When your ML training and scoring algorithms are straightforward, you might
be able to partition data with libraries such as Scikit-learn. In such cases, you
might not need Spark, so you can avoid possible complexities that can arise
when installing Azure Synapse or Azure Databricks.
When the training datasets are already created—for example, they're in
separate files or in separate rows or columns—you don’t need Spark for
complex data transformations.
The Machine Learning and compute clusters solution provides great versatility
for situations that require complex setup. For example, you can make use of a
custom Docker container, or download files, or download pre-trained models.
Computer vision and natural language processing (NLP) deep learning are
examples of applications that might require such versatility.
Spark training and scoring: When you use the Spark architecture, you can use the
Spark pandas function API for parallel training and scoring.
Separate model repos: To protect the deployed models, consider storing them in
their own repository that the training and testing pipelines don't touch.
Online inferencing: If a pipeline loads and caches all models at the start, the
models might exhaust the container's memory. Therefore, load the models on
demand in the run method, even though it might increase latency slightly.
Training scalability: By using Spark, you can train hundreds of thousands of
models in parallel. Spark spins up multiple training processes in every VM in a
cluster. Each core can run a separate process. While this means good utilization of
resources, it's important to size the cluster accurately and choose the right SKU,
especially if the training process is expensive and long running.
Implementation details: For detailed information on implementing a many models
solution, see Implement many models for ML in Azure .

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

To better understand the cost of running this scenario on Azure, use the pricing
calculator . Good starting assumptions are:

The serving models are trained daily to keep them current.


For a dataset of 40 million rows with 10 thousand combinations of store and
product, training on Azure Databricks using a cluster provisioned with 12 VMs that
use Ls16_v2 instances, takes about 30 minutes.
Batch scoring with the same set of data takes about 20 minutes.
You can use Machine Learning to deploy real-time inferencing. Depending on your
request volume, choose an appropriate VM type and cluster size.
An AKS cluster autoscales as needed, resulting in two nodes per month being
active on average.

To see how pricing differs for your use case, change the variables to match your
expected data size and serving load requirements. For larger or smaller training data
sizes, increase or decrease the size of the Azure Databricks cluster. To handle more
concurrent users during model serving, increase the AKS cluster size.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

James Nguyen | Principal Cloud Solution Architect

Next steps
What are compute targets in Azure Machine Learning?
Azure Arc-enabled Machine Learning
Many Models Solution Accelerator
ParallelRunStep Class
pandas function APIs
Connect to storage services on Azure
What is Azure Synapse Analytics?
Deploy a model to an Azure Kubernetes Service cluster

Related resources
Analytics architecture design
Choose an analytical data store in Azure
Choose a data analytics technology in Azure
Many models machine learning (ML) at scale with Azure Machine Learning
Batch scoring of Spark models on Azure Databricks
Many models machine learning
(ML) at scale with Azure Machine
Learning
Azure Data Factory Azure Data Lake Azure Databricks Azure Machine Learning Azure Synapse Analytics

This article describes an architecture for many models that uses Machine Learning and
compute clusters. It provides great versatility for situations that require complex setup.

A companion article, Many models machine learning (ML) at scale in Azure with Spark,
uses Apache Spark in either Azure Databricks or Azure Synapse Analytics.

Architecture

Download a Visio file of this architecture.

Workflow
1. Data ingestion: Azure Data Factory pulls data from a source database and copies it
to Azure Data Lake Storage. It then stores it in a Machine Learning datastore as a
tabular dataset.
2. Model-training pipeline:
a. Prepare data: The training pipeline pulls the data from the datastore and
transforms it further, as needed. It also groups the data into datasets for
training the models.
b. Train models: The pipeline trains models for all the datasets that were created
during data preparation. It uses the ParallelRunStep class to train multiple
models in parallel. After a model is trained, the pipeline registers it into Machine
Learning along with the testing metrics.
3. Model-promotion pipeline:
a. Evaluate models: The promotion pipeline evaluates the trained models before
moving them to production. A DevOps pipeline applies business logic to
determine whether a model meets the criteria for deployment. For example, it
might check that the accuracy of the testing data is over 80 percent.
b. Register models: The promotion pipeline registers the models that qualify to
the production Machine Learning workspace.
4. Model batch-scoring pipeline:
a. Prepare data: The batch-scoring pipeline pulls data from the datastore and
further transforms each file as needed. It also groups the data into datasets for
scoring.
b. Score models: The pipeline uses the ParallelRunStep class to score multiple
datasets in parallel. It finds the appropriate model for each dataset in Machine
Learning by searching the model tags. Then it downloads the model and uses it
to score the dataset. It uses the DataTransferStep class to write the results back
to Azure Data Lake, and then passes predictions from Azure Data Lake to
Synapse SQL for serving.
5. Real-time scoring: Azure Kubernetes Service (AKS) can do real-time scoring if
needed. Because of the large number of models, they should be loaded on
demand, not pre-loaded.
6. Results:
a. Predictions: The batch-scoring pipeline saves predictions to Synapse SQL.
b. Metrics: Power BI connects to the model predictions to retrieve and aggregate
results for presentation.

Components
Azure Machine Learning is an enterprise-grade ML service for building and
deploying models quickly. It provides users at all skill levels with a low-code
designer, automated ML (AutoML), and a hosted Jupyter notebook environment
that supports various IDEs.
Azure Databricks is a cloud-based data-engineering tool that's based on Apache
Spark. It can process and transform massive quantities of data and explore it by
using ML models. You can write jobs in R, Python, Java, Scala, and Spark SQL.
Azure Synapse Analytics is an analytics service that unifies data integration,
enterprise data warehousing, and big data analytics.
Synapse SQL is a distributed query system for T-SQL that enables data
warehousing and data virtualization scenarios and extends T-SQL to address
streaming and ML scenarios. It offers both serverless and dedicated resource
models.
Azure Data Lake Storage is a massively scalable and secure storage service for
high-performance analytics workloads.
Azure Kubernetes Service (AKS) is a fully managed Kubernetes service for
deploying and managing containerized applications. AKS simplifies deployment of
a managed AKS cluster in Azure by offloading the operational overhead to Azure.
Azure DevOps is a set of developer services that provide comprehensive
application and infrastructure lifecycle management. DevOps includes work
tracking, source control, build and CI/CD, package management, and testing
solutions.
Microsoft Power BI is a collection of software services, apps, and connectors that
work together to turn unrelated sources of data into coherent, visually immersive,
and interactive insights.

Alternatives
The source data can come from any database.
You can use a managed online endpoint or AKS to deploy real-time inferencing.

Scenario details
Many machine learning (ML) problems are too complex for a single ML model to solve.
Whether it's predicting sales for every item of every store, or modeling maintenance for
hundreds of oil wells, having a model for each instance might improve results on many
ML problems. This many models pattern is common across a wide variety of industries,
and has many real-world use cases. With the use of Azure Machine Learning, an end-to-
end many models pipeline can include model training, batch-inferencing deployment,
and real-time deployment.

A many models solution requires a different dataset for every model during training and
scoring. For instance, if the task is to predict sales for every item of every store, every
dataset will be for a unique item-store combination.

Potential use cases


Retail: A grocery store chain needs to create a separate revenue forecast model for
each store and item, totaling over 1,000 models per store.
Supply chain: For each combination of warehouse and product, a distribution
company needs to optimize inventory.
Restaurants: A chain with thousands of franchises needs to forecast the demand
for each.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Data partitions Partitioning the data is the key to implementing the many models
pattern. If you want one model per store, a dataset comprises all the data for one
store, and there are as many datasets as there are stores. If you want to model
products by store, there will be a dataset for every combination of product and
store. Depending on the source data format, it may be easy to partition the data,
or it might require extensive data shuffling and transformation. Spark and Synapse
SQL scale very well for such tasks, while Python pandas doesn't, since it runs only
on one node and process.
Model management: The training and scoring pipelines identify and invoke the
right model for each dataset. To do this, they calculate tags that characterize the
dataset, and then use the tags to find the matching model. The tags identify the
data partition key and the model version, and might also provide other
information.
Choosing the right architecture:
Spark is appropriate when your training pipeline has complex data
transformation and grouping requirements. It provides flexible splitting and
grouping techniques to group data by combinations of characteristics, such as
product-store or location-product. The results can be placed in a Spark
DataFrame for use in subsequent steps.
When your ML training and scoring algorithms are straightforward, you might
be able to partition data with libraries such as Scikit-learn. In such cases, you
might not need Spark, so you can avoid possible complexities that can arise
when installing Azure Synapse or Azure Databricks.
When the training datasets are already created—for example, they're in
separate files or in separate rows or columns—you don’t need Spark for
complex data transformations.
The Machine Learning and compute clusters solution provides great versatility
for situations that require complex setup. For example, you can make use of a
custom Docker container, or download files, or download pre-trained models.
Computer vision and natural language processing (NLP) deep learning are
examples of applications that might require such versatility.
Spark training and scoring: When you use the Spark architecture, you can use the
Spark pandas function API for parallel training and scoring.
Separate model repos: To protect the deployed models, consider storing them in
their own repository that the training and testing pipelines don't touch.
ParallelRunStep Class: The Python ParallelRunStep Class is a powerful option to
run many models training and inferencing. It can partition your data in a variety of
ways, and then apply your ML script on elements of the partition in parallel. Like
other forms of Machine Learning training, you can specify a custom training
environment with access to Python Package Index (PyPI) packages, or a more
advanced custom docker environment for configurations that require more than
standard PyPI. There are many CPUs and GPUs to choose from.
Online inferencing: If a pipeline loads and caches all models at the start, the
models might exhaust the container's memory. Therefore, load the models on
demand in the run method, even though it might increase latency slightly.
Implementation details: For detailed information on implementing a many models
solution, see Implement many models for ML in Azure .

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

To better understand the cost of running this scenario on Azure, use the pricing
calculator . Good starting assumptions are:

The serving models are trained daily to keep them current.


For a dataset of 40 million rows with 10 thousand combinations of store and
product, training on Azure Databricks using a cluster provisioned with 12 VMs that
use Ls16_v2 instances, takes about 30 minutes.
Batch scoring with the same set of data takes about 20 minutes.
You can use Machine Learning to deploy real-time inferencing. Depending on your
request volume, choose an appropriate VM type and cluster size.
An AKS cluster autoscales as needed, resulting in two nodes per month being
active on average.

To see how pricing differs for your use case, change the variables to match your
expected data size and serving load requirements. For larger or smaller training data
sizes, increase or decrease the size of the Azure Databricks cluster. To handle more
concurrent users during model serving, increase the AKS cluster size.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

James Nguyen | Principal Cloud Solution Architect

Next steps
Azure Arc-enabled Machine Learning
Many Models Solution Accelerator
ParallelRunStep Class
DataTransferStep Class
Connect to storage services on Azure
What is Azure Synapse Analytics?
Deploy a model to an Azure Kubernetes Service cluster

Related resources
Analytics architecture design
Choose an analytical data store in Azure
Choose a data analytics technology in Azure
Many models machine learning (ML) at scale in Azure with Spark
Azure Machine Learning
architecture
Azure Machine Learning Azure Synapse Analytics Azure Container Registry Azure Monitor Power BI

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This architecture shows you the components used to build, deploy, and manage high-
quality models with Azure Machine Learning, a service for the end-to-end ML lifecycle.

Architecture
Logs, files and media
(unstructured)
Azure Data Azure Synapse
Power BI
Lake Storage Analytics
Ingest and store Analyze Visualize

1 2 8
Build
and 3
Business / custom apps
train
(structured)

Azure Machine Learning

Retrain Authenticate

7 Azure Active Azure Key


Register Directory Vault

4
Deploy 5

Azure Monitor Azure Kubernetes


Service
Azure Container
Registry

Monitor and log

Azure Virtual Azure Load


Network Balancer
Microsoft
Azure

Download a Visio file of this architecture.


7 Note

The architecture described in this article is based on Azure Machine Learning's CLI
and Python SDK v1. For more information on the new v2 SDK and CLI, see What is
CLI and SDK v2.

Dataflow
1. Bring together all your structured, unstructured, and semi-structured data (logs,
files, and media) into Azure Data Lake Storage Gen2.
2. Use Apache Spark in Azure Synapse Analytics to clean, transform, and analyze
datasets.
3. Build and train machine learning models in Azure Machine Learning.
4. Control access and authentication for data and the ML workspace with Microsoft
Entra ID and Azure Key Vault. Manage containers with Azure Container Registry.
5. Deploy the machine learning model to a container using Azure Kubernetes
Services, securing and managing the deployment with Azure VNets and Azure
Load Balancer.
6. Using log metrics and monitoring from Azure Monitor, evaluate model
performance.
7. Retrain models as necessary in Azure Machine Learning.
8. Visualize data outputs with Power BI.

Components
Azure Machine Learning is an enterprise-grade machine learning (ML) service for
the end-to-end ML lifecycle.
Azure Synapse Analytics is a unified service where you can ingest, explore,
prepare, transform, manage, and serve data for immediate BI and machine learning
needs.
Azure Data Lake Storage Gen2 is a massively scalable and secure data lake for
your high-performance analytics workloads.
Azure Container Registry is a registry of Docker and Open Container Initiative
(OCI) images, with support for all OCI artifacts. Build, store, secure, scan, replicate,
and manage container images and artifacts with a fully managed, geo-replicated
instance of OCI distribution.
Azure Kubernetes Service Azure Kubernetes Service (AKS) offers serverless
Kubernetes, an integrated continuous integration and continuous delivery (CI/CD)
experience, and enterprise-grade security and governance. Deploy and manage
containerized applications more easily with a fully managed Kubernetes service.
Azure Monitor lets you collect, analyze, and act on telemetry data from your
Azure and on-premises environments. Azure Monitor helps you maximize
performance and availability of your applications and proactively identify problems
in seconds.
Azure Key Vault safeguards cryptographic keys and other secrets used by cloud
apps and services.
Azure Load Balancer load-balances internet and private network traffic with high
performance and low latency. Load Balancer works across virtual machines, virtual
machine scale sets, and IP addresses.
Power BI is a suite of business analytics tools that deliver insights throughout
your organization. Connect to hundreds of data sources, simplify data prep, and
drive unplanned analysis. Produce beautiful reports, then publish them for your
organization to consume on the web and across mobile devices.

Scenario details
Build, deploy, and manage high-quality models with Azure Machine Learning, a service
for the end-to-end ML lifecycle. Use industry-leading MLOps (machine learning
operations), open-source interoperability, and integrated tools on a secure, trusted
platform designed for responsible machine learning (ML).

Potential use cases


Use machine learning as a service.
Easy and flexible building interface.
Wide range of supported algorithms.
Easy implementation of web services.
Great documentation for machine learning solutions.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

Get pricing estimates

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Sheri Gilley | Senior Content Developer


Larry Franks | Content Developer
Lauryn Gayhardt | Content Developer
Samantha Salgado | Content Developer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
See documentation for the key services in this solution:

Azure Machine Learning documentation


Azure Synapse Analytics documentation
Azure Data Lake Storage Gen2 documentation
Azure Container Registry documentation
Azure Kubernetes Service documentation
Azure Monitor documentation
Azure Key Vault documentation
Azure Load Balancer documentation
Power BI documentation

Related resources
See related guidance on the Azure Architecture Center:

Compare the machine learning products and technologies from Microsoft


Machine learning operations (MLOps) framework to upscale machine learning
lifecycle with Azure Machine Learning
Deploy AI and ML computing on-premises and to the edge
Many models machine learning (ML) at scale with Azure Machine Learning
Scale AI and machine learning initiatives in regulated industries
Predict hospital readmissions by using traditional and automated machine learning
techniques
Secure research environment for regulated data
Machine teaching with Project
Bonsai autonomous systems
Azure Container Instances Azure Container Registry Azure Storage

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

Learn how Project Bonsai builds and deploys autonomous systems using machine
teaching, deep reinforcement learning, and simulations.

Architecture
Project Bonsai speeds the creation of AI-powered automation. Development and
deployment have three phases: Build, Train, and Deploy.

Download a Visio file of this architecture.

Dataflow
1. The Build phase consists of writing the machine teaching program and connecting
to a domain-specific training simulator. Simulators generate sufficient training data
for experiments and machine practice.

Subject matter experts with no AI background can express their expertise as steps,
tasks, criteria, and desired outcomes. Engineers build autonomous systems by
creating accurate, detailed models of systems and environments, and making the
systems intelligent using methods like deep learning, imitation learning, and
reinforcement learning.

2. In the Train phase, the training engine automates DRL model generation and
training by combining high-level domain models with appropriate DRL algorithms
and neural networks.

Simulations train the models across different kinds of environmental conditions


and scenarios much faster and safer than is feasible in the real world. Experts can
supervise the agents as they work to solve problems in simulated environments,
and provide feedback and guidance that lets the agents dynamically adapt within
the simulation.

3. The Deploy phase deploys the trained brain to the target application in the cloud,
on-premises, or embedded on site. Specific SDKs and deployment APIs deploy
trained AI systems to various target applications, perform machine tuning, and
control the physical systems.

After training is complete, engineers deploy these trained agents to the real world,
where they use their knowledge to power autonomous systems.

Components
Project Bonsai simplifies machine teaching with DRL to train and deploy smart
autonomous systems.

Azure Container Registry is a managed, private Docker registry service that's


used to store and manage container images and artifacts for all types of container
deployments. Images are securely stored, and can be replicated to other regions to
speed up deployment. You can build on demand or automate builds with triggers,
such as source code commits and base image updates. Container Registry is based
on the open-source Docker Registry 2.0

This architecture uses the basic tier of Container Registry to store exported brains
and uploaded simulators.
Azure Container Instances runs containers on-demand in a serverless Microsoft
Azure environment. Container Instances is the fastest and simplest way to run a
container in Azure, and doesn't require you to provision virtual machines or adopt
a higher-level service.

This architecture uses Container Instances to run simulations.

Azure Storage is a cloud storage solution that includes object, blob, file, disk,
queue, and table storage.

This architecture uses Storage for storing uploaded simulators as ZIP files.

Scenario details
Artificial intelligence (AI) and machine learning offer unique opportunities and
challenges for automating complex industrial systems. Machine teaching is a new
paradigm for building machine learning systems that moves the focus away from
algorithms and towards successful model generation and deployment.

Machine teaching infuses subject matter expertise into automated AI system training
with deep reinforcement learning (DRL) and simulations. Abstracting away AI complexity
to focus on subject matter expertise and real-world conditions creates models that turn
automated control systems into autonomous systems.

Autonomous systems are automated control systems that:

Use machine teaching to combine human domain knowledge with AI and machine
learning.
Automate the generation and management of DRL algorithms and models.
Integrate simulations for model optimization and scalability during training.
Deploy and scale for real-world use.

Potential use cases


This solution is ideal for the education, facilities, real-estate, manufacturing, government,
automotive, and media and entertainment industries. Project Bonsai speeds the creation
of AI-powered automation to improve product quality and efficiency while reducing
downtime. It's now available in preview, and you can use it to automate systems.
Consider Bonsai when you face issues such as:

Existing control systems are fragile when deployed.


Machine learning logic doesn't adequately cover all scenarios.
Describing the desired system behavior requires subject matter experts who
understand the problem domain.
Generating sufficient real-world data to cover all scenarios is difficult or impossible.
Traditional control systems are difficult to deploy and scale to the real world.

Machine teaching bridges AI science and software with traditional engineering and
domain expertise. Example applications include:

Motion control
Machine calibration
Smart buildings
Industrial robotics
Process control

Deploy this scenario


The following implementations are example deployments. You can follow the resources
to understand how these solutions were designed. Use Project Bonsai to build and
deploy your own solution.

Machine teaching service


You can use Bonsai to:

Teach adaptive brains with intuitive goals and learning objectives, real-time success
assessments, and automatic versioning control.
Integrate training simulations that implement real-world problems and provide
realistic feedback.
Export trained brains and deploy them on-premises, in the cloud, or to IoT Edge
devices or embedded devices.

Here's the Bonsai user interface:


In Bonsai, managed Azure graphics processing unit (GPU) clusters run AI training on
complex neural networks at scale, with built-in support for retraining and analyzing AI
system versions. The deployment and runtime frameworks package and deploy the
resulting AI system models at scale.

The Bonsai platform runs on Azure and charges resource costs to your Azure
subscription.

Azure Container Registry (basic tier) for storing exported brains and uploaded
simulators.
Azure Container Instances for running simulations.
Azure Storage for storing uploaded simulators as ZIP files.

Inkling

Inkling is a declarative, statically typed programming language for training AI in Bonsai.


Inkling abstracts away the dynamic AI algorithms that require expertise in machine
learning, enabling more developers to program AI. An Inkling file defines concepts
necessary to teach the AI, and curriculum, or methods for teaching the concepts.
For more information about Inkling, see the Inkling programming language reference.

Training engine
The training engine in Bonsai compiles machine teaching programs to automatically
generate and train AI systems. It does the following:

Automates model generation, management, and tuning.


Defines the neural network architecture. It specifies characteristics such as number
of layers and topology, selects the best DRL algorithm, and tunes the hyper-
parameters of the model.
Connects to the simulator and orchestrates the training.

Just as a language compiler hides the machine code from the programmer, the training
engine hides the details of the machine learning models and DRL algorithms. As new
algorithms and network topologies are invented, the training engine can recompile the
same machine teaching programs to exploit them.

Cartpole sample
Bonsai includes two machine teaching samples, Cartpole and Moab .

The Cartpole sample has a pole attached to a cart by an unactivated joint. The cart
moves along a straight frictionless track and the pole moves forward and backward,
depending on the movements of the cart. The available sensor information includes the
cart position and velocity and pole angle and angular velocity. The supported agent
actions are to push the cart to the left or the right.

The pole starts upright, and the goal is to keep it upright as the cart moves. There is a
reward generated for every time interval that the pole remains upright. A training
episode ends when the pole is more than 15 degrees from vertical, or when the cart
moves more than a predefined number of units from the center of the track.

The sample uses Inkling language to write the machine teaching program, and the
provided Cartpole simulator to speed and improve the training.

The following Bonsai screenshot shows Cartpole training progress, with Goal
satisfaction on the y-axis and Training iterations on the x-axis. The dashboard also
shows the percentage of goal satisfaction and the total elapsed training time.

For more information about the Cartpole example, or to try it yourself, see:

Quickstart: Balance a pole with AI (Cartpole)


Learn how you can teach an AI agent to balance a pole
Simulators
Simulations model a system in a virtual representation of its intended physical
environment. Simulations are an alternative approach to creating learning policies by
hand or collecting large amounts of real-world training data. Simulations allow training
in hazardous environments, or in conditions difficult to reproduce in the real world.

Simulations are the ideal training source for DRL because they:

Can flexibly create custom environments.


Are safe and cost-effective for data generation.
Can run concurrently on multiple training machines to speed up training.

Simulations are available across a broad range of industries and systems such as
mechanical and electrical engineering, autonomous vehicles, security and networking,
transportation and logistics, and robotics.

Simulation tools include:

Simulink , a graphical programming tool developed by MathWorks to model,


simulate, and analyze dynamic systems.
Gazebo , which simulates populations of robots in complex indoor and outdoor
environments.
Microsoft AirSim , an open-source robotics simulation platform.

The Bonsai platform includes Simulink and AnyLogic simulators. You can add others.

AirSim
Microsoft AirSim (Aerial Informatics and Robotics Simulation) is an open-source
robotics simulation platform designed to train autonomous systems. AirSim provides a
realistic simulation tool for designers and developers to generate the large amounts of
data they need for model training and debugging.

AirSim can capture data from ground vehicles, wheeled robotics, aerial drones, and even
static IoT devices, and do it without costly field operations.
AirSim works as a plug-in to the Unreal Engine editor from Epic Games, providing
control over building environments and simulating difficult-to-reproduce, real-world
events to capture meaningful data. AirSim leverages current game engine rendering,
physics, and perception computation to create an accurate, real-world simulation.

This realism, based on efficiently generated ground-truth data, enables the study and
execution of complex missions that are time-consuming or risky in the real world. For
example, AirSim provides realistic environments, vehicle dynamics, and multi-modal
sensing for researchers building autonomous vehicles. Collisions in a simulator cost
virtually nothing, yet provide actionable information to improve the design of the
system.

You can use an Azure Resource Manager (ARM) template to automatically create a
development environment, and code and debug a Python application connected to
AirSim in Visual Studio Code. For more information, see AirSim Development
Environment on Azure .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Jose Contreras | Principal Software Engineering Manager

Next steps
Autonomous systems with Microsoft AI
Autonomy for industrial control systems
Innovation space: Autonomous systems (Video)
Microsoft The AI Blog
Microsoft Autonomous Systems
Bonsai documentation
Aerial Informatics and Robotics Platform (AirSim)
How Azure Machine Learning works: Architecture and concepts

Related resources
Use subject matter expertise in machine teaching and reinforcement learning
Building blocks for autonomous-driving simulation environments
Compare the machine learning products and technologies from Microsoft
Data science and machine
learning with Azure Databricks
Azure Databricks Azure Data Lake Storage Azure Kubernetes Service (AKS) Azure Machine Learning

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This architecture shows how you can improve operations by using Azure Databricks,
Delta Lake, and MLflow for data science and machine learning. You can improve your
overall efficiency and the customer experience by developing, training, and deploying
machine learning models.

Architecture
Process Serve

4
Azure Databricks Azure Machine Azure
Learning web Kubernetes
services Service (AKS)

1 2 3

Bronze Silver Gold

Azure Data
Lake Storage

Store
Microsoft
Azure

Download a Visio file of this architecture.

The solution stores, processes, and serves data:

Dataflow

Store

Data Lake Storage stores the data in Delta Lake format. Delta Lake forms the curated
layer of the data lake. A medallion architecture organizes the data into three layers:

Bronze tables hold raw data.


Silver tables contain cleaned, filtered data.
Gold tables store aggregated data that's ready for analytics and reporting.

Process
Code from various languages, frameworks, and libraries prepares, refines, and
cleanses the raw data (1). Coding possibilities include Python, R, SQL, Spark,
Pandas, and Koalas.

Azure Databricks runs data science workloads. This platform also builds and trains
machine learning models (2). Azure Databricks uses pre-installed, optimized
libraries. Examples include scikit-learn, TensorFlow, PyTorch, and XGBoost.

MLflow tracking captures the machine learning experiments, model runs, and
results (3). When the best model is ready for production, Azure Databricks deploys
that model to the MLflow model repository. This centralized registry stores
information on production models. The registry also makes models available to
other components:
Spark and Python pipelines can ingest models. These pipelines handle batch
workloads or streaming ETL processes.
REST APIs provide access to models for many purposes. Examples include
testing and interactive scoring in mobile and web applications.

Serve
Azure Databricks can deploy models to other services, such as Machine Learning and
AKS (4).

Components
Azure Databricks is a data analytics platform. Its fully managed Spark clusters
run data science workloads. Azure Databricks also uses pre-installed, optimized
libraries to build and train machine learning models. MLflow integration with Azure
Databricks provides a way to track experiments, store models in repositories, and
make models available to other services. Azure Databricks offers scalability:
Single-node compute clusters handle small data sets and single-model runs.
For large data sets, multi-node compute clusters or graphics processing unit
(GPU) clusters are available. These clusters use libraries and frameworks like
HorovodRunner and Hyperopt for parallel-model runs.

Data Lake Storage is a scalable and secure data lake for high-performance
analytics workloads. This service manages multiple petabytes of information while
sustaining hundreds of gigabits of throughput. The data can have these
characteristics:
Be structured, semi-structured, or unstructured.
Come from multiple, heterogeneous sources like logs, files, and media.
Be static, from batches, or streaming.

Delta Lake is a storage layer that uses an open file format. This layer runs on top
of cloud storage such as Data Lake Storage. Delta Lake is optimized for
transforming and cleansing batch and streaming data. This platform supports
these features and functionality:
Data versioning and rollback.
Atomicity, consistency, isolation, and durability (ACID) transactions for reliability.
A consistent standard for data preparation, model training, and model serving.
Time travel for consistent snapshots of source data. Data scientists can train
models on the snapshots instead of creating separate copies.

MLflow is an open-source platform for the machine learning life cycle. MLflow
components monitor machine learning models during training and running. Stored
information includes code, data, configuration information, and results. MLflow
also stores models and loads them in production. Because MLflow uses open
frameworks, various services, applications, frameworks, and tools can consume the
models.

Machine Learning is a cloud-based environment that helps you build, deploy,


and manage predictive analytics solutions. With these models, you can forecast
behavior, outcomes, and trends.

AKS is a highly available, secure, and fully managed Kubernetes service. AKS
makes it easy to deploy and manage containerized applications.

Scenario details
As your organization recognizes the power of data science and machine learning, you
can improve efficiency, enhance customer experiences, and predict changes. To achieve
these goals in business-critical use cases, you need a consistent and reliable pattern for:

Tracking experiments.
Reproducing results.
Deploying machine learning models into production.

This article outlines a solution for a consistent, reliable machine learning framework.
Azure Databricks forms the core of the architecture. The storage layer Delta Lake and
the machine learning platform MLflow also play significant roles. These components
integrate seamlessly with other services such as Azure Data Lake Storage, Azure
Machine Learning, and Azure Kubernetes Service (AKS).

Together, these services provide a solution for data science and machine learning that's:
Simple: An open data lake simplifies the architecture. The data lake contains a
curated layer, Delta Lake. That layer provides access to the data in an open-source
format.

Open: The solution supports open-source code, open standards, and open
frameworks. This approach minimizes the need for future updates. Azure
Databricks and Machine Learning natively support MLflow and Delta Lake.
Together, these components provide industry-leading machine learning operations
(MLOps), or DevOps for machine learning. A broad range of deployment tools
integrate with the solution's standardized model format.

Collaborative: Data science and MLOps teams work together with this solution.
These teams use MLflow tracking to record and query experiments. The teams also
deploy models to the central MLflow model registry. Data engineers then use
deployed models in data ingestion, extract-transform-load (ETL) processes, and
streaming pipelines.

Potential use cases


A platform that AGL built for energy forecasting inspired this solution. That platform
provides quick and cost-effective training, deployment, and life-cycle management for
thousands of parallel models.

Besides energy providers, this solution can benefit any organization that:

Uses data science.


Builds and trains machine learning models.
Runs machine learning models in production.

Examples include organizations in:

Retail and e-commerce.


Banking and finance.
Healthcare and life sciences.
Automotive industries and manufacturing.

Next steps
AGL Energy builds a standardized platform for thousands of parallel models. The
platform provides quick and cost-effective training, deployment, and life-cycle
management for the models.
Open Grid Europe (OGE) uses artificial intelligence models to monitor gas
pipelines. OGE uses Azure Databricks and MLflow to develop the models.
Scandinavian Airlines (SAS) uses Azure Databricks during a collaborative
research phase. The airline also uses Machine Learning to develop predictive
models. By identifying patterns in the company's data, the models improve
everyday operations.

Related resources
Choose an analytical data store in Azure
Batch scoring of Spark models on Azure Databricks
Stream processing with Azure Databricks
Ingestion, ETL, and stream processing pipelines with Azure Databricks
Modern analytics architecture with Azure Databricks
Automate document
identification, classification, and
search by using Durable Functions
Azure Functions Azure App Service Azure AI services Azure Cognitive Search

Azure Kubernetes Service (AKS)

This article describes an architecture for processing document files that contain multiple
documents of various types. It uses the Durable Functions extension of Azure Functions
to implement the pipelines that process the files.

Architecture

Download a Visio file of this architecture.

Workflow
1. The user provides a document file that the web app uploads. The file contains
multiple documents of various types. It can, for instance, be a PDF or multipage
TIFF file.
a. The document file is stored in Azure Blob Storage.
b. The web app adds a command message to a storage queue to initiate pipeline
processing.

2. Durable Functions orchestration is triggered by the command message. The


message contains metadata that identifies the location in Blob Storage of the
document file to be processed. Each Durable Functions instance processes only
one document file.

3. The Scan activity function calls the Computer Vision Read API, passing in the
location in storage of the document to be processed. Optical character recognition
(OCR) results are returned to the orchestration to be used by subsequent activities.

4. The Classify activity function calls the document classifier service that's hosted in
an Azure Kubernetes Service (AKS) cluster. This service uses regular expression
pattern matching to identify the starting page of each known document and to
calculate how many document types are contained in the document file. The types
and page ranges of the documents are calculated and returned to the
orchestration.

7 Note

Azure doesn’t offer a service that can classify multiple document types in a
single file. This solution uses a non-Azure service that's hosted in AKS.

5. The Metadata Store activity function saves the document type and page range
information in an Azure Cosmos DB store.

6. The Indexing activity function creates a new search document in the Cognitive
Search service for each identified document type and uses the Azure Cognitive
Search libraries for .NET to include in the search document the full OCR results and
document information. A correlation ID is also added to the search document so
that the search results can be matched with the corresponding document
metadata from Azure Cosmos DB.

7. End users can search for documents by contents and metadata. Correlation IDs in
the search result set can be used to look up document records that are in Azure
Cosmos DB. The records include links to the original document file in Blob Storage.

Components
Durable Functions is an extension of Azure Functions that makes it possible for
you write stateful functions in a serverless compute environment. In this
application, it's used for managing document ingestion and workflow
orchestration. It lets you define stateful workflows by writing orchestrator functions
that adhere to the Azure Functions programming model. Behind the scenes, the
extension manages state, checkpoints, and restarts, leaving you free to focus on
the business logic.
Azure Cosmos DB is a globally distributed, multi-model database that makes it
possible for your solutions to scale throughput and storage capacity across any
number of geographic regions. Comprehensive service level agreements (SLAs)
guarantee throughput, latency, availability, and consistency.
Azure Storage is a set of massively scalable and secure cloud services for data,
apps, and workloads. It includes Blob Storage , Azure Files , Azure Table
Storage , and Azure Queue Storage .
Azure App Service provides a framework for building, deploying, and scaling
web apps. The Web Apps feature is an HTTP-based service for hosting web
applications, REST APIs, and mobile back ends. With Web Apps, you can develop in
.NET, .NET Core, Java, Ruby, Node.js, PHP, or Python. Applications easily run and
scale in Windows and Linux-based environments.
Azure Cognitive Services provides intelligent algorithms to see, hear, speak,
understand, and interpret your user needs by using natural methods of
communication.
Azure Cognitive Search provides a rich search experience over private,
heterogeneous content in web, mobile, and enterprise applications.
AKS is a highly available, secure, and fully managed Kubernetes service. AKS
makes it easy to deploy and manage containerized applications.

Alternatives
The Form Recognizer read (OCR) model is an alternative to Computer Vision Read.
This solution stores metadata in Azure Cosmos DB to facilitate global distribution.
Azure SQL Database is another option for persistent storage of document
metadata and information.
You can use other messaging platforms, including Azure Service Bus , to trigger
Durable Functions instances.
For a solution accelerator that helps in clustering and segregating data into
templates, see Azure/form-recognizer-accelerator (github.com) .

Scenario details
This article describes an architecture that uses Durable Functions to implement
automated pipelines for processing document files that contain multiple documents of
various types. The pipelines identify the documents in a document file, classify them by
type, and store information that can be used in subsequent processing.

Many companies need to manage and process document files that contain documents
that have been scanned in bulk and that can contain several different document types.
Typically the document files are PDFs or multi-page TIFF images. These files usually
originate from outside the organization, and the receiving company doesn't control the
content.

Given these constraints, organizations have been forced to build their own document
parsing solutions that can include custom technology and manual processes. A solution
can include human intervention for splitting out individual document types into their
own files and adding classifications qualifiers for each document.

Many of these custom solutions are based on the state machine workflow pattern and
use database systems for persisting workflow state, with polling services that check for
the states that they're responsible for processing. Maintaining and enhancing such
solutions can be difficult and time consuming.

Organizations are looking for reliable, scalable, and resilient solutions for processing and
managing document identification and classification for the types of files their
organization uses. This includes processing millions of documents per day with full
observability into the success or failure of the processing pipeline.

Potential use cases


This solution applies to many areas:

Title reporting. Many government agencies and municipalities manage paper


records that haven't been migrated to digital form. An effective automated
solution can generate a file that contains all the documents that are required to
satisfy a document request.
Maintenance records. Aircraft, locomotive, and machinery maintenance records
still exist in paper form that require scanning and sending to outside organizations.
Permit processing. City and county permitting departments still maintain paper
documents that are generated for permit inspection reporting. The ability to take a
picture of several inspection documents and automatically identify, classify, and
search across these records can be highly beneficial.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework

Reliability
Reliability ensures that your application can meet the commitments that you make to
your customers. For more information, see Overview of the reliability pillar.

A reliable workload is one that's both resilient and available. Resiliency is the ability of
the system to recover from failures and continue to function. The goal of resiliency is to
return the application to a fully functioning state after a failure occurs. Availability is a
measure of whether your users can access your workload when they need to.

For reliability information about solution components, see the following resources:

SLA for Azure Cognitive Search


SLA for Azure Applied AI Services
SLA for Azure Functions
SLA for App Service
SLA for Storage Accounts
SLA for Azure Kubernetes Service (AKS)
SLA for Azure Cosmos DB

Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.

The most significant costs for this architecture will potentially come from the storage of
image files in the storage account, Cognitive Services image processing, and index
capacity requirements in the Azure Cognitive Search service.

Costs can be optimized by right sizing the storage account by using reserved capacity
and lifecycle policies, proper Azure Cognitive Search planning for regional deployments
and operational scale up scheduling, and using commitment tier pricing that's available
for the Computer Vision – OCR service to manage predictable costs.

Here are some guidelines for optimizing costs:

Use the pay-as-you-go strategy for your architecture and scale out as needed
rather than investing in large-scale resources at the start.
Consider opportunity costs in your architecture, and the balance between first-
mover advantage versus fast follow. Use the pricing calculator to estimate the
initial cost and operational costs.
Establish policies, budgets, and controls that set cost limits for your solution.

Performance efficiency
Performance efficiency is the ability of your workload to scale in an efficient manner to
meet the demands that users place on it. For more information, see Performance
efficiency pillar overview.

Periods when this solution processes high volumes can expose performance bottlenecks.
Make sure that you understand and plan for the scaling options for Azure Functions,
Cognitive Services autoscaling, and Azure Cosmos DB partitioning to ensure proper
performance efficiency for your solution.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Kevin Kraus | Principal Cloud Solution Architect


Andrea Martini | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Introductory articles:

Introduction to Azure Storage


What are Durable Functions?
What are Azure Cognitive Services?
What’s Azure Cognitive Search?
App Service overview
Introduction to Azure Cosmos DB
Azure Kubernetes Service
What is Azure Service Bus?

Product documentation:
Azure documentation (all products)
Durable Functions documentation
Azure Cognitive Services documentation
Azure Cognitive Search documentation

Related resources
Custom document processing models on Azure
Automate document processing by using Azure Form Recognizer
Image classification on Azure
Automate document processing
by using Azure Form Recognizer
Azure Cognitive Search Azure AI services Azure Cosmos DB Azure Document Intelligence

Azure Machine Learning

This article outlines a scalable and secure solution for building an automated document
processing pipeline. The solution uses Azure Form Recognizer for the structured
extraction of data. Natural language processing (NLP) models and custom models enrich
the data.

Architecture
Other sources of data Extraction Enrichment Analytics and visualizations

Data to be Azure
processed Trigger Cognitive
4 Service for
Attachments 1 Language
(email or social media apps) Azure Blob Storage

Real-time scoring
Back-end
application 2 Power BI

FTP servers
6
Azure Kubernetes
Azure Functions Service (AKS)
5
Azure Web Azure Form
Application Recognizer Other
Firewall 3 Batch scoring Azure Machine Learning applications
1
2

Azure Cosmos
DB Azure Cognitive Search
1 3


File ingestion Browser Azure Back-end
Application application
Gateway

Microsoft
Azure

Download a Visio file of this architecture.

Dataflow
The following sections describe the various stages of the data extraction process.

Data ingestion and extraction


1. Documents are ingested through a browser at the front end of a web application.
The documents contain images or are in PDF format. Azure App Service hosts a
back-end application. The solution routes the documents to that application
through Azure Application Gateway. This load balancer runs with Azure Web
Application Firewall, which helps to protect the application from common attacks
and vulnerabilities.

2. The back-end application posts a request to a Form Recognizer REST API endpoint
that uses one of these models:
Layout
Invoice
Receipt
ID document
Business card
General document, which is in preview

The response from Form Recognizer contains raw OCR data and structured
extractions. Form Recognizer also assigns [confidence values][Characteristics and
limitations of Form Recognizer - Customer evaluation] to the extracted data.

3. The App Service back-end application uses the confidence values to check the
extraction quality. If the quality is below a specified threshold, the app flags the
data for manual verification. When the extraction quality meets requirements, the
data enters Azure Cosmos DB for downstream application consumption. The app
can also return the results to the front-end browser.

4. Other sources provide images, PDF files, and other documents. Sources include
email attachments and File Transfer Protocol (FTP) servers. Tools like Azure Data
Factory and AzCopy transfer these files to Azure Blob Storage. Azure Logic Apps
offers pipelines for automatically extracting attachments from emails.

5. When a document enters Blob Storage, an Azure function is triggered. The


function:

Posts a request to the relevant Form Recognizer pre-built endpoint.


Receives the response.
Evaluates the extraction quality.

6. The extracted data enters Azure Cosmos DB.

Data enrichment

The pipeline that's used for data enrichment depends on the use case.

1. Data enrichment can include the following NLP capabilities:

Named entity recognition (NER)


The extraction of personal information, key phrases, health information, and
other domain-dependent entities

To enrich the data, the web app:

Retrieves the extracted data from Azure Cosmos DB.


Posts requests to these features of the Azure Cognitive Service for Language
API:
NER
Personal information
Key phrase extraction
Text analytics for health
Custom NER, which is in preview
Sentiment analysis
Opinion mining

Receives responses from the Azure Cognitive Service for Language API.

2. Custom models perform fraud detection, risk analysis, and other types of analysis
on the data:

Azure Machine Learning services train and deploy the custom models.
The extracted data is retrieved from Azure Cosmos DB.
The models derive insights from the data.

These possibilities exist for inferencing:

Real-time processes. The models can be deployed to managed online


endpoints or Kubernetes online endpoints, where managed Kubernetes
cluster can be anywhere including Azure Kubernetes Service (AKS) .
Batch inferencing can be done at batch endpoints or in Azure Virtual
Machines.

3. The enriched data enters Azure Cosmos DB.

Analytics and visualizations

1. Applications use the raw OCR, structured data from Form Recognizer endpoints,
and the enriched data from NLP:

Power BI displays the data and presents reports on it.


The data functions as a source for Azure Cognitive Search.
Other applications consume the data.

Components
App Service is a platform as a service (PaaS) offering on Azure. You can use App
Service to host web applications that you can scale in or scale out manually or
automatically. The service supports various languages and frameworks, such as
ASP.NET, ASP.NET Core, Java, Ruby, Node.js, PHP, and Python.

Application Gateway is a layer-7 (application layer) load balancer that manages


traffic to web applications. You can run Application Gateway with Azure Web
Application Firewall to help protect web applications from common exploits and
vulnerabilities.

Azure Functions is a serverless compute platform that you can use to build
applications. With Functions, you can use triggers and bindings to react to changes
in Azure services like Blob Storage and Azure Cosmos DB. Functions can run
scheduled tasks, process data in real time, and process messaging queues.

Form Recognizer is part of Azure Applied AI Services. Form Recognizer offers a


collection of pre-built endpoints for extracting data from invoices, documents,
receipts, ID cards, and business cards. This service maps each piece of extracted
data to a field as a key-value pair. Form Recognizer also extracts table content and
structure. The output format is JSON.

Azure Storage is a cloud storage solution that includes object, blob, file, disk,
queue, and table storage.

Blob Storage is a service that's part of Azure Storage. Blob Storage offers
optimized cloud object storage for large amounts of unstructured data.

Azure Data Lake Storage is a scalable, secure data lake for high-performance
analytics workloads. The data typically comes from multiple heterogeneous
sources and can be structured, semi-structured, or unstructured. Azure Data Lake
Storage Gen2 combines Azure Data Lake Storage Gen1 capabilities with Blob
Storage. As a next-generation solution, Data Lake Storage Gen2 provides file
system semantics, file-level security, and scale. But it also offers the tiered storage,
high availability, and disaster recovery capabilities of Blob Storage.

Azure Cosmos DB is a fully managed, highly responsive, scalable NoSQL


database. Azure Cosmos DB offers enterprise-grade security and supports APIs for
many databases, languages, and platforms. Examples include SQL, MongoDB,
Gremlin, Table, and Apache Cassandra. Serverless, automatic scaling options in
Azure Cosmos DB efficiently manage capacity demands of applications.

Azure Cognitive Service for Language offers many NLP services that you can use
to understand and analyze text. Some of these services are customizable, such as
custom NER, custom text classification, conversational language understanding,
and question answering.
Machine Learning is an open platform for managing the development and
deployment of machine-learning models at scale. Machine Learning caters to skill
levels of different users, such as data scientists or business analysts. The platform
supports commonly used open frameworks and offers automated featurization
and algorithm selection. You can deploy models to various targets. Examples
include AKS, Azure Container Instances as a web service for real-time inferencing
at scale, and Azure Virtual Machine for batch scoring. Managed endpoints in
Machine Learning abstract the required infrastructure for real-time or batch model
inferencing.

AKS is a fully managed Kubernetes service that makes it easy to deploy and
manage containerized applications. AKS offers serverless Kubernetes technology,
an integrated continuous integration and continuous delivery (CI/CD) experience,
and enterprise-grade security and governance.

Power BI is a collection of software services and apps that display analytics


information.

Azure Cognitive Search is a cloud search service that supplies infrastructure,


APIs, and tools for searching. You can use Azure Cognitive Search to build search
experiences over private, heterogeneous content in web, mobile, and enterprise
applications.

Alternatives
You can use Azure Virtual Machines instead of App Service to host your
application.

You can use any relational database for persistent storage of the extracted data,
including:
Azure SQL Database .
Azure Database for PostgreSQL .
Azure Database for MySQL .

Scenario details
Automating document processing and data extraction is an integral task in
organizations across all industry verticals. AI is one of the proven solutions in this
process, although achieving 100 percent accuracy is a distant reality. But, using AI for
digitization instead of purely manual processes can reduce manual effort by up to 90
percent.
Optical character recognition (OCR) can extract content from images and PDF files,
which make up most of the documents that organizations use. This process uses key
word search and regular expression matching. These mechanisms extract relevant data
from full text and then create structured output. This approach has drawbacks. Revising
the post-extraction process to meet changing document formats requires extensive
maintenance effort.

Potential use cases


This solution is ideal for the finance industry. It can also apply to the automotive, travel,
and hospitality industries. The following tasks can benefit from this solution:

Approving expense reports


Processing invoices, receipts, and bills for insurance claims and financial audits
Processing claims that include invoices, discharge summaries, and other
documents
Automating statement of work (SoW) approvals
Automating ID extraction for verification purposes, as with passports or driver
licenses
Automating the process of entering business card data into visitor management
systems
Identifying purchase patterns and duplicate financial documents for fraud
detection

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Keep these points in mind when you use this solution.

Availability
The availability of the architecture depends on the Azure services that make up the
solution:

Form Recognizer is part of Applied AI Services. For this service's availability


guarantee, see SLA for Azure Applied AI Services .
Azure Cognitive Service for Language is part of Azure Cognitive Services. For the
availability guarantee for these services, see SLA for Azure Cognitive Services .

Azure Cosmos DB provides high availability by maintaining four replicas of data


within each region and by replicating data across regions. The exact availability
guarantee depends on whether you replicate within a single region or across
multiple regions. For more information, see Achieve high availability with Azure
Cosmos DB.

Blob Storage offers redundancy options that help ensure high availability. You can
use either of these approaches to replicate data three times in a primary region:
At a single physical location for locally redundant storage (LRS).
Across three availability zones that use differing availability parameters. For
more information, see Durability and availability parameters. This option works
best for applications that require high availability.

For the availability guarantees of other Azure services in the solution, see these
resources:
SLA for App Service
SLA for Azure Functions
SLA for Application Gateway
SLA for Azure Kubernetes Service (AKS)

Scalability
App Service can automatically scale out and in as the application load varies. For
more information, see Create an autoscale setting for Azure resources based on
performance data or a schedule.

Azure Functions can scale automatically or manually. The hosting plan that you
choose determines the scaling behavior of your function apps. For more
information, see Azure Functions hosting options.

By default, Form Recognizer supports 15 concurrent requests per second. You can
increase this value by creating an Azure support ticket with a quota increase
request.

For custom models that you host as web services on AKS, azureml-fe automatically
scales as needed. This front-end component routes incoming inference requests to
deployed services.

For batch inferencing, Machine Learning creates a compute cluster on demand


that scales automatically. For more information, see Tutorial: Build an Azure
Machine Learning pipeline for batch scoring. Machine Learning uses the
ParellelRunStep class to run the inferencing jobs in parallel.

For Azure Cognitive Service for Language, data and rate limits apply. For more
information, see these resources:
How to use named entity recognition (NER)
How to detect and redact personal information
How to use sentiment analysis and opinion mining
How to use Text Analytics for health

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Azure Web Application Firewall helps protect your application from common
vulnerabilities. This Application Gateway option uses Open Web Application
Security Project (OWASP) rules to prevent attacks like cross-site scripting, session
hijacks, and other exploits.

To improve App Service security, consider these options:


App Service can access resources in Azure Virtual Network through virtual
network integration.
You can use App Service in an app service environment (ASE), which you deploy
to a dedicated virtual network. This approach helps to isolate the connectivity
between App Service and other resources in the virtual network.

For more information, see Security in Azure App Service.

Blob Storage and Azure Cosmos DB encrypt data at rest. You can secure these
services by using service endpoints or private endpoints.

Azure Functions supports virtual network integration. By using this functionality,


function apps can access resources inside a virtual network. For more information,
see Azure Functions networking options.

You can configure Form Recognizer and Azure Cognitive Service for Language for
access from specific virtual networks or from private endpoints. These services
encrypt data at rest. You can use subscription keys, tokens, or Microsoft Entra ID to
authenticate requests to these services. For more information, see Authenticate
requests to Azure Cognitive Services.

Machine Learning offers many levels of security:


Workspace authentication provides identity and access management.
You can use authorization to manage access to the workspace.
By securing workspace resources, you can improve network security.
You can use Transport Layer Security (TLS) to secure web services that you
deploy through Machine Learning.
To protect data, you can change the access keys for Azure Storage accounts that
Machine Learning uses.

Resiliency
The solution's resiliency depends on the failure modes of individual services like
App Service, Functions, Azure Cosmos DB, Storage, and Application Gateway. For
more information, see Resiliency checklist for specific Azure services.

You can make Form Recognizer resilient. Possibilities include designing it to fail
over to another region and splitting the workload into two or more regions. For
more information, see Back up and recover your Form Recognizer models.

Machine Learning services depend on many Azure services. To provide resiliency,


you need to configure each service to be resilient. For more information, see
Failover for business continuity and disaster recovery.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

The cost of implementing this solution depends on which components you use and
which options you choose for each component.

Many factors can affect the price of each component:

The number of documents that you process


The number of concurrent requests that your application receives
The size of the data that you store after processing
Your deployment region

These resources provide information on component pricing options:

Azure Form Recognizer pricing


App Service pricing
Azure Functions pricing
Application Gateway pricing
Azure Blob Storage pricing
Azure Cosmos DB pricing
Language Service pricing
Azure Machine Learning pricing

After deciding on a pricing tier for each component, use the Azure Pricing calculator
to estimate the solution cost.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Jyotsna Ravi | Senior Customer Engineer

Next steps
What is Azure Form Recognizer?
[Get started: Document Intelligence Studio Studio][Get started: Document
Intelligence Studio Studio]
Use Form Recognizer SDKs or REST API
What is Azure Cognitive Service for Language?
What is Azure Machine Learning?
Introduction to Azure Functions
How to configure Azure Functions with a virtual network
What is Azure Application Gateway?
What is Azure Web Application Firewall on Azure Application Gateway?
Tutorial: How to access on-premises SQL Server from Data Factory Managed VNet
using Private Endpoint
Azure Storage documentation

Related resources
Extract text from objects using Power Automate and AI Builder
[Knowledge mining in business process management][Knowledge mining in
business process management]
[Knowledge mining in contract management][Knowledge mining in contract
management]
Knowledge mining for content research
Automate PDF forms processing
Azure Document Intelligence Azure AI services Azure Logic Apps Azure Functions

This article describes an Azure architecture that you can use to replace costly and
inflexible forms processing methods with cost-effective and flexible automated PDF
processing.

Architecture

Download a PowerPoint file of this architecture.

Workflow
1. A designated Outlook email account receives PDF files as attachments. The arrival
of an email triggers a logic app to process the email. The logic app is built by using
the capabilities of Azure Logic Apps.
2. The logic app uploads the PDF files to a container in Azure Data Lake Storage.
3. You can also manually or programmatically upload PDF files to the same PDF
container.
4. The arrival of a PDF file in the PDF container triggers another logic app to process
the PDF forms that are in the PDF file.
5. The logic app sends the location of the PDF file to a function app for processing.
The function app is built by using the capabilities of Azure Functions.
6. The function app receives the location of the file and takes these actions:
a. It splits the file into single pages if the file has multiple pages. Each page
contains one independent form. Split files are saved to a second container in
Data Lake Storage.
b. It uses HTTPS POST, an Azure REST API, to send the location of the single-page
PDF file to Azure Form Recognizer for processing. When Form Recognizer
completes its processing, it sends a response back to the function app, which
places the information into a data structure.
c. It creates a JSON data file that contains the response data and stores the file to
a third container in Data Lake Storage.
7. The forms processing logic app receives the processed response data.
8. The forms processing logic app sends the processed data to Azure Cosmos DB,
which saves the data in a database and in collections.
9. Power BI obtains the data from Azure Cosmos DB and provides insights and
dashboards.
10. You can implement further processing as needed on the data that's in Azure
Cosmos DB.

Components
Azure Applied AI Services is a category of Azure AI products that use Azure
Cognitive Services, task-specific AI, and business logic to provide turnkey AI
services for common business processes. One of these products is Form
Recognizer , which uses machine learning models to extract key-value pairs, text,
and tables from documents.
Azure Logic Apps is a serverless cloud service for creating and running
automated workflows that integrate apps, data, services, and systems.
Azure Functions is a serverless solution that makes it possible for you to write
less code, maintain less infrastructure, and save on costs.
Azure Data Lake Storage is the foundation for building enterprise data lakes on
Azure.
Azure Cosmos DB is a fully managed NoSQL and relational database for modern
app development.
Power BI is a collection of software services, apps, and connectors that work
together so that you can turn your unrelated sources of data into coherent, visually
immersive, and interactive insights.

Alternatives
You can use Azure SQL Database instead of Azure Cosmos DB to store the
processed forms data.
You can use Azure Data Explorer to visualize the processed forms data that's
stored in Data Lake Storage.

Scenario details
Forms processing is often a critical business function. Many companies still rely on
manual processes that are costly, time consuming, and prone to error. Replacing manual
processes reduces cost and risk and makes a company more agile.

This article describes an architecture that you can use to replace manual PDF forms
processing or costly legacy systems that automate PDF forms processing. Form
Recognizer processes the PDF forms, Logic Apps provides the workflow, and Functions
provides data processing capabilities.

For deployment information, see Deploy this scenario in this article.

Potential use cases


The solution that's described in this article can process many types of forms, including:

Invoices
Payment records
Safety records
Incident records
Compliance records
Purchase orders
Payment authorization forms
Health screening forms
Survey forms

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework, a
set of guiding tenets that you can use to improve the quality of a workload. For more
information, see Microsoft Azure Well-Architected Framework.

Reliability
Reliability ensures that your application can meet the commitments that you make to
your customers. For more information, see Overview of the reliability pillar.
A reliable workload is one that's both resilient and available. Resiliency is the ability of
the system to recover from failures and continue to function. The goal of resiliency is to
return the application to a fully functioning state after a failure occurs. Availability is a
measure of whether your users can access your workload when they need to.

This architecture is intended as a starter architecture that you can quickly deploy and
prototype to provide a business solution. If your prototype is a success, you can then
extend and enhance the architecture, if necessary, to meet additional requirements.

This architecture utilizes scalable and resilient Azure infrastructure and technologies. For
example, Azure Cosmos DB has built-in redundancy and global coverage that you can
configure to meet your needs.

For the availability guarantees of the Azure services that this solution uses, see Service
Level Agreements (SLA) for Online Services .

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

The Outlook email account that's used in this architecture is a dedicated email account
that receives PDF forms as attachments. It's good practice to limit the senders to trusted
parties only and to prevent malicious actors from spamming the email account.

The implementation of this architecture that's described in Deploy this scenario takes
the following measures to increase security:

The PowerShell and Bicep deployment scripts use Azure Key Vault to store sensitive
information so that it isn't displayed on terminal screens or stored in deployment
logs.
Managed identities provide an automatically managed identity in Microsoft Entra
ID for applications to use when they connect to resources that support Microsoft
Entra authentication. The function app uses managed identities so that the code
doesn't depend on individual principals and doesn't contain sensitive identity
information.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and to
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Here are some guidelines for optimizing costs:

Use the pay-as-you-go strategy for your architecture, and scale out as needed
rather than investing in large-scale resources at the start.
The implementation of the architecture that's described in Deploy this scenario
deploys a starting solution that's suitable for proof of concept. The deployment
scripts create a working architecture with minimal resource requirements. For
example, the deployment scripts create a smallest serverless Linux host to run the
function app.

Performance efficiency
Performance efficiency is the ability of your workload to scale in an efficient manner to
meet the demands that are placed on it by users. For more information, see
Performance efficiency pillar overview.

This architecture uses services that have built-in scaling capabilities that you can use to
improve performance efficiency. Here are some examples:

You can host both Azure Logic Apps and Azure Functions in a serverless
infrastructure. For more information, see Azure serverless overview: Create cloud-
based apps and solutions with Azure Logic Apps and Azure Functions.
You can configure Azure Cosmos DB to automatically scale its throughput. For
more information, see Provision autoscale throughput on a database or container
in Azure Cosmos DB - API for NoSQL.

Deploy this scenario


You can deploy a rudimentary version of this architecture, a solution accelerator, and use
it as a starting point for deploying your own solution. The reference implementation for
the accelerator includes code, deployment scripts, and a deployment guide.

The accelerator receives the PDF forms, extracts the data fields, and saves the data in
Azure Cosmos DB. Power BI visualizes the data. The design uses a modular, metadata-
driven methodology. No form fields are hard-coded. It can process any PDF forms.

You can use the accelerator as is, without code modification, to process and visualize
any single-page PDF forms such as safety forms, invoices, incident records, and many
others. To use it, you only need to collect sample PDF forms, train a new model to learn
the layout of the forms, and plug the model into the solution. You also need to redesign
the Power BI report for your datasets so that it provides the insights that you want.
The implementation uses Form Recognizer Studio to create custom models. The
accelerator uses the field names that are saved in the machine learning model as a
reference to process other forms. Only five sample forms are needed to create a
custom-built machine learning model. You can merge as many as 100 custom-built
models to create a composite machine learning model that can process a variety of
forms.

Deployment repository
The GitHub repository for the solution accelerator is:

https://github.com/microsoft/Azure-PDF-Form-Processing-Automation-Solution-
Accelerator

The readme file that's displayed at that location provides an overview of the accelerator.

The deployment files are in the top-level Deployment folder of the repository:

https://github.com/microsoft/Azure-PDF-Form-Processing-Automation-Solution-
Accelerator/tree/main/Deployment

The readme file that's displayed at that location is the deployment guide. You deploy by
following the steps.

Step 2 provides details about using sample PDF forms to create a custom-built machine
learning model. You plug the model into the solution by setting the environment
variable called CUSTOM_BUILT_MODEL_ID to the machine model name in the function
app. For more information, see step 3.

Deployment prerequisites
To deploy, you need an Azure subscription. For information about free subscriptions, see
Build in the cloud with an Azure free account .

To learn about the services that are used in the accelerator, see the overview and
reference articles that are listed in:

Azure Form Recognizer documentation


Azure Logic Apps documentation
Azure Functions documentation
Introduction to Azure Data Lake Storage Gen2
Azure Cosmos DB documentation
Power BI documentation

Deployment considerations
To process a new type of PDF form, you use sample PDF files to create a new machine
learning model. When the model is ready, you plug the model ID into the solution.

This container name is configurable in the deployment scripts that you get from the
GitHub repository.

The architecture doesn't address any high availability (HA) or disaster recovery (DR)
requirements. If you want to extend and enhance the current architecture for production
deployment, consider the following recommendations and best practices:

Design the HA/DR architecture based on your requirements and use the built-in
redundancy capabilities where applicable.
Update the Bicep deployment code to create a computing environment that can
handle your processing volumes.
Update the Bicep deployment code to create more instances of the architecture
components to satisfy your HA/DR requirements.
Follow the guidelines in Azure Storage redundancy when you design and provision
storage.
Follow the guidelines in Business continuity and disaster recovery when you design
and provision the logic apps.
Follow the guidelines in Reliability in Azure Functions when you design and
provision the function app.
Follow the guidelines in Achieve high availability with Azure Cosmos DB when you
design and provision a database that was created by using Azure Cosmos DB.
If you consider putting this system into production to process large volumes of
PDF forms, you can modify the deployment scripts to create a Linux Host that has
more resources. To do so, modify the code inside
https://github.com/microsoft/Azure-PDF-Form-Processing-Automation-Solution-
Accelerator/blob/main/Deployment/1_deployment_scripts/deploy-
functionsapp.bicep

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:
Gail Zhou | Sr. Architect

Other contributors:

Nalini Chandhi | Principal Technical Specialist


Steve DeMarco | Sr. Cloud Solution Architect
Travis Hilbert | Technical Specialist Global Black Belt
DB Lee | Sr. Technical Specialist
Malory Rose | Technical Specialist Global Black Belt
Oscar Shimabukuro | Sr. Cloud Solution Architect
Echo Wang | Principal Program Manager

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps
Video: Azure PDF Form Processing Automation SA .
Azure PDF Form Processing Automation Solution Accelerator
Azure invoice Process Automation Solution Accelerator
Business Process Automation Accelerator
Tutorial: Create workflows that process emails using Azure Logic Apps, Azure
Functions, and Azure Storage

Related resources
Custom document processing models on Azure
Index file content and metadata by using Azure Cognitive Search
Automate document identification, classification, and search by using Durable
Functions
Automate document processing by using Azure Form Recognizer
Custom document processing
models on Azure
Azure Document Intelligence Azure AI services Azure Logic Apps Azure Machine Learning Studio

Azure Storage

This article describes Azure solutions for building, training, deploying, and using custom
document processing models. These Azure services also offer user interface (UI)
capabilities to do labeling or tagging for text processing.

Architecture
Data ingestion Labeling, tagging,
Source Data store Deployment
and orchestration and training
Built-in
deployment

Form Recognizer Form Recognizer


Logic Apps
Studio (custom model parameters)
Email servers
Blob Storage

1 2 3
4
Language Cognitive Service for Language
FTP server Data Factory Studio (custom model parameters)

Data Lake
Storage
Machine Learning Kubernetes Batch/online
Web Apps Function Apps managed
Studio Services
endpoints

Microsoft

Azure

Download a Visio file of this architecture.

Dataflow
1. Orchestrators like Azure Logic Apps, Azure Data Factory, or Azure Functions ingest
messages and attachments from email servers, and files from FTP servers or web
applications.

Azure Functions and Logic Apps enable serverless workloads. The service you
choose depends on your preference for service capabilities like development,
connectors, management, and execution context. For more information, see
Compare Azure Functions and Azure Logic Apps.

Consider using Azure Data Factory for bulk data movement.


2. The orchestrators send ingested data to Azure Blob Storage or Data Lake Storage,
organizing the data across data stores based on characteristics like file extensions
or customers.

3. Form Recognizer Studio, Language Studio, or Azure Machine Learning studio label
and tag textual data and build the custom models. You can use these three services
independently or in various combinations to address different use cases.

If the document requires extracting key-value pairs or creating a custom table


from an image format or PDF, use Form Recognizer Studio to tag the data
and train the custom model.

For document classification based on content, or for domain-specific entity


extraction, you can train a custom text classification or Named Entity
Recognition (NER) model in Language Studio.

Azure Machine Learning studio can also do labeling for text classification or
entity extraction with open-source frameworks like PyTorch or TensorFlow.

4. To deploy the custom models and use them for inference:

Form Recognizer has built-in model deployment. Use Form Recognizer SDKs
or the REST API to apply custom models for inferencing. Include the model
ID or custom model name in the Form Recognizer request URL,
depending on the API version. Form Recognizer doesn't require any further
deployment steps.

Language Studio provides an option to deploy custom language models. Get


the REST endpoint prediction URL by selecting the model to deploy. You can
do model inferencing by using either the REST endpoint or the Azure SDK
client libraries.

Azure Machine Learning can deploy custom models to online or batch Azure
Machine Learning managed endpoints. You can also deploy to Azure
Kubernetes Service (AKS) as a web service by using the Azure Machine
Learning SDK.

Components
Logic Apps is part of Azure Integration Services . Logic Apps creates
automated workflows that integrate apps, data, services, and systems. With
managed connectors for services like Azure Storage and Office 365, you can
trigger workflows when a file lands in the storage account or email is received.
Data Factory is a managed cloud extract, transform, load (ETL) service for data
integration and transformation. Data Factory can add transformation activities to a
pipeline that include invoking a REST endpoint or running a notebook on the
ingested data.

Azure Functions is a serverless compute service that can host event-driven


workloads with short-lived processes.

Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for multiple languages, such as .NET, Node.js, and
Python. Applications can access files on Blob Storage via HTTP/HTTPS. Blob
Storage has hot, cool, and archive access tiers to support cost optimization for
storing large amounts of data.

Data Lake Storage is a set of capabilities built on Azure Blob Storage for big data
analytics. Data Lake Storage retains the cost effectiveness of Blob Storage, and
provides features like file-level security and file system semantics with hierarchical
namespace.

Form Recognizer , part of Azure Applied AI Services , has in-built document


analysis capabilities to extract printed and handwritten text, tables, and key-value
pairs. Form Recognizer has prebuilt models for extracting data from invoices,
documents, receipts, ID cards, and business cards. Form Recognizer can also train
and deploy custom models by using either a custom template form model or a
custom neural document model.

Form Recognizer Studio provides a UI for exploring Form Recognizer features


and models, and for building, tagging, training, and deploying custom models.

Azure Cognitive Service for Language consolidates the Azure natural language
processing services. The suite offers prebuilt and customizable options. For more
information, see the Cognitive Service for Language available features.

Language Studio provides a UI for exploring and analyzing Azure Cognitive


Service for Language features. Language Studio also provides options for building,
tagging, training, and deploying custom models.

Azure Machine Learning is an open platform for managing machine learning


model development and deployment at scale.
Azure Machine Learning studio provides data labeling options for images and
text.
Export labeled data as COCO or Azure Machine Learning datasets. You can
use the datasets for training and deploying models in Azure Machine Learning
notebooks.
Deploy models to AKS as a web service for real-time inferencing at scale, or as
managed endpoints for both real-time and batch inferencing.

Alternatives
You can add more workflows to this scenario based on specific use cases.

If the document is in image or PDF format, you can extract the data by using Azure
Computer Vision, Form Recognizer Read API, or open-source libraries.

You can do document and conversation summarization by using the prebuilt


model in Azure Cognitive Service for Language.

Use pre-processing code to do text processing steps like cleaning, stop words
removal, lemmatization, stemming, and text summarization on extracted data, per
document processing requirements. You can expose the code as REST APIs for
automation. Do these steps manually or automate them by integrating with the
Logic Apps or Azure Functions ingestion process.

Scenario details
Document processing is a broad area. It can be difficult to meet all your document
processing needs with the prebuilt models available in Azure Form Recognizer and
Azure Cognitive Service for Language. You might need to build custom models to
automate document processing for different applications and domains.

Major challenges in model customization include:

Labeling or tagging text data with relevant key-value pair entities to classify text
for extraction.
Deploying models securely at scale for easy integration with consuming
applications.

Potential use cases


The following use cases can take advantage of custom models for document processing:

Build custom NER and text classification models based on open-source


frameworks.
Extract custom key-values from documents for various industry verticals like
insurance and healthcare.
Tag and extract specific domain-dependent entities beyond the prebuilt NER
models, for domains like security or finance.
Create custom tables from documents.
Extract signatures.
Label and classify emails or other documents based on content.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

For this example workload, implementing each pillar depends on optimally configuring
and using each component Azure service.

Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.

Availability
See the availability service level agreements (SLAs) for each component Azure
service:
Azure Form Recognizer - SLA for Azure Applied AI Services .
Azure Cognitive Service for Language - SLA for Azure Cognitive Services .
Azure Functions - SLA for Azure Functions .
Azure Kubernetes Service - SLA for Azure Kubernetes Service (AKS) .
Azure Storage - SLA for Storage Accounts .

For configuration options to design high availability applications with Azure


storage accounts, see Use geo-redundancy to design highly available applications.

Resiliency

Handle failure modes of individual services like Azure Functions and Azure Storage
to ensure resiliency of the compute services and data stores in this scenario. For
more information, see Resiliency checklist for specific Azure services.

For Form Recognizer, back up and recover your Form Recognizer models.
For custom text classification with Cognitive Services for Language, back up and
recover your custom text classification models.

For custom NER in Cognitive Services for Language, back up and recover your
custom NER models.

Azure Machine Learning depends on constituent services like Blob Storage,


compute services, and AKS. To provide resiliency for Azure Machine Learning,
configure each of these services to be resilient. For more information, see Failover
for business continuity and disaster recovery.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Implement data protection, identity and access management, and network security
recommendations for Blob Storage, Cognitive Services for Form Recognizer and
Language Studio, and Azure Machine Learning.

Azure Functions can access resources in a virtual network through virtual network
integration.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

The total cost of implementing this solution depends on the pricing of the services you
choose.

The major costs for this solution are:

The compute cost involved in Azure Machine Learning training. Choose the right
node type, cluster size, and number of nodes to help optimize costs. Azure
Machine Learning provides options to set the minimum nodes to zero and to set
the idle time before the scale down. For more information, see Manage and
optimize Azure Machine Learning costs.

Data orchestration duration and activities. For Azure Data Factory, the charges for
copy activities on the Azure integration runtime are based on the number of Data
Integration Units (DIUs) used and the execution duration. Added orchestration
activity runs are also charged, based on their number.

Logic Apps pricing plans depend on the resources you create and use. The
following articles can help you choose the right plan for specific use cases:
Costs that typically accrue with Azure Logic Apps
Single-tenant versus multi-tenant and integration service environment for Azure
Logic Apps
Usage metering, billing, and pricing models for Azure Logic Apps

For more information on pricing for specific components, see the following resources:

Azure Form Recognizer pricing


Azure Functions pricing
Logic Apps Pricing
Azure Data Factory pricing
Azure Blob Storage pricing
Language Service pricing
Azure Machine Learning pricing

Use the Azure pricing calculator to add your selected component options and
estimate the overall solution cost.

Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.

Scalability

To scale Azure Functions automatically or manually, choose the right hosting plan.

Form Recognizer supports 15 concurrent requests per second by default. To


request an increased quota, create an Azure support ticket.

For Azure Machine Learning custom models hosted as web services on AKS, the
azureml-fe front end automatically scales as needed. This component also routes
incoming inference requests to deployed services.

For deployments as managed endpoints, support autoscaling by integrating with


the Azure Monitor autoscale feature.
The API service limits on custom NER and custom text classification for inferencing
are 20 GET or POST requests per minute.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributor.

Principal author:

Jyotsna Ravi | Sr. Customer Engineer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Get started: Form Recognizer Studio
Use Form Recognizer SDKs or REST API
Quickstart: Get started with Language Studio
What is optical character recognition (OCR)?
How to configure Azure Functions with a virtual network

Related resources
Extract text from objects using Power Automate and AI Builder
Suggest content tags with NLP using deep learning
Knowledge mining for content research
Automate document processing by using Azure Form Recognizer
Index file content and metadata
by using Azure Cognitive Search
Azure Cognitive Search Azure Blob Storage Azure Table Storage

This article demonstrates how to create a search service that enables users to search for
documents based on document content in addition to any metadata that's associated
with the files.

You can implement this service by using multiple indexers in Azure Cognitive Search.

This article uses an example workload to demonstrate how to create a single search
index that's based on files in Azure Blob Storage. The file metadata is stored in Azure
Table Storage.

Architecture
Download a PowerPoint file of this architecture.

Dataflow
1. Files are stored in Blob Storage, possibly together with a limited amount of
metadata (for example, the document's author).
2. Additional metadata is stored in Table Storage, which can store significantly more
information for each document.
3. An indexer reads the contents of each file, together with any blob metadata, and
stores the data in the search index.
4. Another indexer reads the additional metadata from the table and stores it in the
same search index.
5. A search query is sent to the search service. The query returns matching
documents, based on both document content and document metadata.

Components
Blob Storage provides cost-effective cloud storage for file data, including data in
formats like PDF, HTML, and CSV, and in Microsoft Office files.
Table Storage provides storage for nonrelational structured data. In this scenario,
it's used to store the metadata for each document.
Azure Cognitive Search is a fully managed search service that provides
infrastructure, APIs, and tools for building a rich search experience.

Alternatives
This scenario uses indexers in Azure Cognitive Search to automatically discover new
content in supported data sources, like blob and table storage, and then add it to the
search index. Alternatively, you can use the APIs provided by Azure Cognitive Search to
push data to the search index. If you do, however, you need to write code to push the
data into the search index and also to parse and extract text from the binary documents
that you want to search. The Blob Storage indexer supports many document formats,
which significantly simplifies the text extraction and indexing process.

Also, if you use indexers, you can optionally enrich the data as part of an indexing
pipeline. For example, you can use Azure Cognitive Services to perform optical character
recognition (OCR) or visual analysis of the images in documents, detect the language of
documents, or translate documents. You can also define your own custom skills to
enrich the data in ways that are relevant to your business scenario.

This architecture uses blob and table storage because they're cost-effective and
efficient. This design also enables combined storage of the documents and metadata in
a single storage account. Alternative supported data sources for the documents
themselves include Azure Data Lake Storage and Azure Files. Document metadata can
be stored in any other supported data source that holds structured data, like Azure SQL
Database and Azure Cosmos DB.

Scenario details

Searching file content


This solution enables users to search for documents based on both file content and
additional metadata that's stored separately for each document. In addition to searching
the text content of a document, a user might want to search for the document's author,
the document type (like paper or report), or its business impact (high, medium, or low).

Azure Cognitive Search is a fully managed search service that can create search indexes
that contain the information you want to allow users to search for.

Because the files that are searched in this scenario are binary documents, you can store
them in Blob Storage. If you do, you can use the built-in Blob Storage indexer in Azure
Cognitive Search to automatically extract text from the files and add their content to the
search index.

Searching file metadata


If you want to include additional information about the files, you can directly associate
metadata with the blobs, without using a separate store. The built-in Blob Storage
search indexer can even read this metadata and place it in the search index. This enables
users to search for metadata along with the file content. However, the amount of
metadata is limited to 8 KB per blob, so the amount of information that you can place
on each blob is fairly small. You might choose to store only the most critical information
directly on the blobs. In this scenario, only the document's author is stored on the blob.

To overcome this storage limitation, you can place additional metadata in another data
source that has a supported indexer, like Table Storage. You can add the document type,
business impact, and other metadata values as separate columns in the table. If you
configure the built-in Table Storage indexer to target the same search index as the blob
indexer, the blob and table storage metadata is combined for each document in the
search index.

Using multiple data sources for a single search index


To ensure that both indexers point to the same document in the search index, the
document key in the search index is set to a unique identifier of the file. This unique
identifier is then used to refer to the file in both data sources. The blob indexer uses the
metadata_storage_path as the document key, by default. The metadata_storage_path

property stores the full URL of the file in Blob Storage, for example,
https://contoso.blob.core.windows.net/files/paper/Resilience in Azure.pdf . The

indexer performs Base64 encoding on the value to ensure that there are no invalid
characters in the document key. The result is a unique document key, like
aHR0cHM6...mUucGRm0 .
If you add the metadata_storage_path as a column in Table Storage, you know exactly
which blob the metadata in the other columns belongs to, so you can use any
PartitionKey and RowKey value in the table. For example, you could use the blob

container name as the PartitionKey and the Base64-encoded full URL of the blob as the
RowKey , ensuring that there are no invalid characters in these keys either.

You can then use a field mapping in the table indexer to map the
metadata_storage_path column (or another column) in Table Storage to the

metadata_storage_path document key field in the search index. If you apply the

base64Encode function on the field mapping, you end up with the same document key
( aHR0cHM6...mUucGRm0 in the earlier example), and the metadata from Table Storage is
added to the same document that was extracted from Blob Storage.

7 Note

The table indexer documentation states that you shouldn't define a field mapping
to an alternative unique string field in your table. That's because the indexer
concatenates the PartitionKey and RowKey as the document key, by default.
Because you're already relying on the document key as configured by the blob
indexer (which is the Base64-encoded full URL of the blob), creating a field
mapping to ensure that both indexers refer to the same document in the search
index is appropriate and supported for this scenario.

Alternatively, you can map the RowKey (which is set to the Base64-encoded full URL of
the blob) to the metadata_storage_path document key directly, without storing it
separately and Base64-encoding it as part of the field mapping. However, keeping the
unencoded URL in a separate column clarifies which blob it refers to and allows you to
choose any partition and row keys without affecting the search indexer.

Potential use cases


This scenario applies to applications that require the ability to search for documents
based on their content and additional metadata.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that you can use to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures that your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.

Azure Cognitive Search provides a high SLA for reads (querying) if you have at least
two replicas. It provides a high SLA for updates (updating the search indexes) if you have
at least three replicas. You should therefore provision at least two replicas if you want
your users to be able to search reliably, and three if actual changes to the index also
need to be high-availability operations.

Azure Storage always stores multiple copies of your data to help protect it against
planned and unplanned events. Azure Storage provides additional redundancy options
for replicating data across regions. These safeguards apply to data in blob and table
storage.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Azure Cognitive Search provides robust security controls that help you implement
network security, authentication and authorization, data residency and protection, and
administrative controls that help you maintain security, privacy, and compliance.

Whenever possible, use Microsoft Entra authentication to provide access to the search
service itself, and connect your search service to other Azure resources (like blob and
table storage in this scenario) by using a managed identity.

You can connect from the search service to the storage account by using a private
endpoint. When you use a private endpoint, the indexers can use a private connection
without requiring the blob and table storage to be accessible publicly.

Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.

For information about the costs of running this scenario, see this preconfigured estimate
in the Azure pricing calculator . All the services described here are configured in this
estimate. The estimate is for a workload that has a total document size of 20 GB in Blob
Storage and 1 GB of metadata in Table Storage. Two search units are used to satisfy the
SLA for read purposes, as described in the reliability section of this article. To see how
the pricing would change for your particular use case, change the appropriate variables
to match your expected usage.

If you review the estimate, you can see that the cost of blob and table storage is
relatively low. Most of the cost is incurred by Azure Cognitive Search, because it
performs the actual indexing and compute for running search queries.

Deploy this scenario


To deploy this example workload, see Indexing file contents and metadata in Azure
Cognitive Search . You can use this sample to:

Create the required Azure services.


Upload a few sample documents to Blob Storage.
Populate the author metadata value on the blob.
Store the document type and business impact metadata values in Table Storage.
Create the indexers that maintain the search index.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Jelle Druyts | Principal Customer Experience Engineer

Other contributor:

Mick Alberts | Technical Writer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Get started with Azure Cognitive Search
Increase relevancy using semantic search in Azure Cognitive Search
Security filters for trimming results in Azure Cognitive Search
Tutorial: Index from multiple data sources using the .NET SDK

Related resources
Choose a search data store in Azure
Intelligent product search engine for e-commerce
Process free-form text for search
Analyze video content with
Computer Vision and Azure
Machine Learning
Azure Machine Learning Azure AI services Azure Logic Apps Azure Synapse Analytics

Azure Data Lake Storage

This article describes an architecture that you can use to replace the manual analysis of
video footage with an automated, and frequently more accurate, machine learning
process.

The FFmpeg and Jupyter Notebook logos are trademarks of their respective companies. No
endorsement is implied by the use of these marks.

Architecture

Download a PowerPoint file of this architecture.

Workflow
1. A collection of video footage, in MP4 format, is uploaded to Azure Blob Storage.
Ideally, the videos go into a "raw" container.
2. A preconfigured pipeline in Azure Machine Learning recognizes that video files are
uploaded to the container and initiates an inference cluster to start separating the
video footage into frames.
3. FFmpeg, an open-source tool, breaks down the video and extracts frames. You can
configure how many frames per second are extracted, the quality of the extraction,
and the format of the image file. The format can be JPG or PNG.
4. The inference cluster sends the images to Azure Data Lake Storage.
5. A preconfigured logic app that monitors Data Lake Storage detects that new
images are being uploaded. It starts a workflow.
6. The logic app calls a pretrained custom vision model to identify objects, features,
or qualities in the images. Alternatively or additionally, it calls a computer vision
(optical character recognition) model to identify textual information in the images.
7. Results are received in JSON format. The logic app parses the results and creates
key-value pairs. You can store the results in Azure dedicated SQL pools that are
provisioned by Azure Synapse Analytics.
8. Power BI provides data visualization.

Components
Azure Blob Storage provides object storage for cloud-native workloads and
machine learning stores. In this architecture, it stores the uploaded video files.
Azure Machine Learning is an enterprise-grade machine learning service for the
end-to-end machine learning lifecycle.
Azure Data Lake Storage provides massively scalable, enhanced-security, cost-
effective cloud storage for high-performance analytics workloads.
Computer Vision is part of Azure Cognitive Services . It's used to retrieve
information about each image.
Custom Vision enables you to customize and embed state-of-the-art computer
vision image analysis for your specific domains.
Azure Logic Apps automates workflows by connecting apps and data across
environments. It provides a way to access and process data in real time.
Azure Synapse Analytics is a limitless analytics service that brings together data
integration, enterprise data warehousing, and big data analytics.
Dedicated SQL pool (formerly SQL DW) is a collection of analytics resources that
are provisioned when you use Azure Synapse SQL.
Power BI is a collection of software services, apps, and connectors that work
together to provide visualizations of your data.

Alternatives
Azure Video Indexer is a video analytics service that uses AI to extract actionable
insights from stored videos. You can use it without any expertise in machine
learning.
Azure Data Factory is a fully managed serverless data integration service that
helps you construct ETL and ELT processes.
Azure Functions is a serverless platform as a service (PaaS) that runs single-task
code without requiring new infrastructure.
Azure Cosmos DB is a fully managed NoSQL database for modern app
development.

Scenario details
Many industries record video footage to detect the presence or absence of a particular
object or entity or to classify objects or entities. Video monitoring and analyses are
traditionally performed manually. These processes are often monotonous and prone to
errors, particularly for tasks that are difficult for the human eye. You can automate these
processes by using AI and machine learning.

A video recording can be separated into individual frames so that various technologies
can analyze the images. One such technology is computer vision: the capability of a
computer to identify objects and entities on an image.

With computer vision, monitoring video footage becomes automatized, standardized,


and potentially more accurate. A computer vision model can be trained, and, depending
on the use case, you can frequently get results that are at least as good as those of the
person who trained the model. By using Machine Learning Operations (MLOps) to
improve the model continuously, you can expect better results over time, and react to
changes in the video data over time.

Potential use cases


This scenario is relevant for any business that analyzes videos. Here are some sample
use cases:

Agriculture. Monitor and analyze crops and soil conditions over time. By using
drones or UAVs, farmers can record video footage for analysis.

Environmental sciences. Analyze aquatic species to understand where they're


located and how they evolve. By attaching underwater cameras to boats,
environmental researchers can navigate the shoreline to record video footage.
They can analyze the video footage to understand species migrations and how
species populations change over time.
Traffic control. Classify vehicles into categories (SUV, car, truck, motorcycle), and
use the information to plan traffic control. Video footage can be provided by CCTV
in public locations. Most CCTV cameras record date and time, which can be easily
retrieved via optical character recognition (OCR).

Quality assurance. Monitor and analyze quality control in a manufacturing facility.


By installing cameras on the production line, you can train a computer vision
model to detect anomalies.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework, a
set of guiding tenets that you can use to improve the quality of a workload. For more
information, see Microsoft Azure Well-Architected Framework.

Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.

A reliable workload is one that's both resilient and available. Resiliency is the ability of
the system to recover from failures and continue to function. The goal of resiliency is to
return the application to a fully functioning state after a failure occurs. Availability is a
measure of whether your users can access your workload when they need to.

For the availability guarantees of the Azure services in this solution, see these resources:

SLA for Storage Accounts


SLA for Azure Machine Learning
SLA for Azure Cognitive Services
SLA for Logic Apps
SLA for Azure Synapse Analytics
SLA for Power BI

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Consider the following resources:

Identity management
Protect your infrastructure
Application security
Data sovereignty and encryption
Security resources

Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.

Here are some guidelines for optimizing costs:

Use the pay-as-you-go strategy for your architecture, and scale out as needed
rather than investing in large-scale resources at the start.
Consider opportunity costs in your architecture, and the balance between first-
mover advantage versus fast follow. Use the pricing calculator to estimate the
initial cost and operational costs.
Establish policies, budgets, and controls that set cost limits for your solution.

Operational excellence
Operational excellence covers the operations processes that deploy an application and
keep it running in production. For more information, see Overview of the operational
excellence pillar.

Deployments need to be reliable and predictable. Here are some guidelines:

Automate deployments to reduce the chance of human error.


Implement a fast, routine deployment process to avoid slowing down the release
of new features and bug fixes.
Quickly roll back or roll forward if an update causes problems.

Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.

Appropriate use of scaling and the implementation of PaaS offerings that have built-in
scaling are the main ways to achieve performance efficiency.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Oscar Shimabukuro Kiyan | Senior Cloud Solutions Architect – Data & AI

Other contributors:

Mick Alberts | Technical Writer


Brandon Cowen | Senior Cloud Solutions Architect – Data & AI
Arash Mosharraf | Senior Cloud Solutions Architect – Data & AI
Priyanshi Singh | Senior Cloud Solutions Architect – Data & AI
Julian Soh | Director Specialist – Data & AI

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Introduction to Azure Storage
What is Azure Machine Learning?
What is Azure Cognitive Services?
What is Azure Logic Apps?
What is Azure Synapse Analytics?
What is Power BI embedded analytics?
Business Process Accelerator

Related resources
Image classification with convolutional neural networks (CNNs)
Image classification on Azure
MLOps framework to upscale machine learning lifecycle
Image classification on Azure
Azure Blob Storage Azure Computer Vision Azure Cosmos DB Azure Event Grid Azure Functions

By using Azure services, such as the Computer Vision API and Azure Functions,
companies can eliminate the need to manage individual servers, while reducing costs
and utilizing the expertise that Microsoft has already developed with processing images
with Cognitive Services. This example scenario specifically addresses an image-
processing use case. If you have different AI needs, consider the full suite of Cognitive
Services.

Architecture

Download a Visio file of this architecture.

Workflow
This scenario covers the back-end components of a web or mobile application. Data
flows through the scenario as follows:

1. Adding new files (image uploads) in Blob storage triggers an event in Azure Event
Grid. The uploading process can be orchestrated via the web or a mobile
application. Alternatively, images can be uploaded separately to the Azure Blob
storage.
2. Event Grid sends a notification that triggers the Azure Functions.
3. Azure Functions calls the Azure Computer Vision API to analyze the newly
uploaded image. Computer Vision accesses the image via the blob URL that's
parsed by Azure Functions.
4. Azure Functions persists the Computer Vision API response in Azure Cosmos DB.
This response includes the results of the analysis, along with the image metadata.
5. The results can be consumed and reflected on the web or mobile front end. Note
that this approach retrieves the results of the classification but not the uploaded
image.

Components
Computer Vision API is part of the Cognitive Services suite and is used to
retrieve information about each image.
Azure Functions provides the back-end API for the web application. This
platform also provides event processing for uploaded images.
Azure Event Grid triggers an event when a new image is uploaded to blob
storage. The image is then processed with Azure functions.
Azure Blob Storage stores all of the image files that are uploaded into the web
application, as well any static files that the web application consumes.
Azure Cosmos DB stores metadata about each image that is uploaded, including
the results of the processing from Computer Vision API.

Alternatives
Custom Vision Service . The Computer Vision API returns a set of taxonomy-
based categories. If you need to process information that isn't returned by the
Computer Vision API, consider the Custom Vision Service, which lets you build
custom image classifiers.
Cognitive Search (formerly Azure Search). If your use case involves querying the
metadata to find images that meet specific criteria, consider using Cognitive
Search. Currently in preview, Cognitive search seamlessly integrates this
workflow.
Logic Apps . If you don't need to react in real-time on added files to a blob, you
might consider using Logic Apps. A logic app which can check if a file was added
might be start by the recurrence trigger or sliding windows trigger.

Scenario details
This scenario is relevant for businesses that need to process images.

Potential applications include classifying images for a fashion website, analyzing text
and images for insurance claims, or understanding telemetry data from game
screenshots. Traditionally, companies would need to develop expertise in machine
learning models, train the models, and finally run the images through their custom
process to get the data out of the images.

Potential use cases


This solution is ideal for the retail, game, finance, and insurance industries. Other
relevant use cases include:

Classifying images on a fashion website. Image classification can be used by


sellers while uploading pictures of products on the platform for sale. They can then
automate the consequent manual tagging involved. The customers can also search
through the visual impression of the products.

Classifying telemetry data from screenshots of games. The classification of video


games from screenshots is evolving into a relevant problem in social media,
coupled with computer vision. For example, when Twitch streamers play different
games in succession, they might skip manually updating their stream information.
Failure to update stream information could result in the misclassification of
streams in user searches and might lead to the loss of potential viewership for
both the content creators and the streaming platforms. While introducing novel
games, a custom model route could be helpful to introduce the capability to
detect novel images from those games.

Classifying images for insurance claims. Image classification can help reduce the
time and cost of claims processing and underwriting. It could help analyze natural-
disaster damage, vehicle-damage, and identify residential and commercial
properties.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Consider these points when implementing this solution:

Scalability
The majority of the components used in this example scenario are managed services
that will automatically scale. A couple of notable exceptions: Azure Functions has a limit
of a maximum of 200 instances. If you need to scale beyond this limit, consider multiple
regions or app plans.

You can provision Azure Cosmos DB to autoscale in Azure Cosmos DB for NoSQL only. If
you plan to use other APIs, see guidance on estimating your requirements in Request
units. To fully take advantage of the scaling in Azure Cosmos DB, understand how
partition keys work in Azure Cosmos DB.

NoSQL databases frequently trade consistency (in the sense of the CAP theorem) for
availability, scalability, and partitioning. In this example scenario, a key-value data model
is used and transaction consistency is rarely needed as most operations are by definition
atomic. Additional guidance to Choose the right data store is available in the Azure
Architecture Center. If your implementation requires high consistency, you can choose
your consistency level in Azure Cosmos DB.

For general guidance on designing scalable solutions, see the performance efficiency
checklist in the Azure Architecture Center.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Managed identities for Azure resources are used to provide access to other resources
internal to your account and then assigned to your Azure Functions. Only allow access
to the requisite resources in those identities to ensure that nothing extra is exposed to
your functions (and potentially to your customers).

For general guidance on designing secure solutions, see the Azure Security
Documentation.

Resiliency
All of the components in this scenario are managed, so at a regional level they are all
resilient automatically.

For general guidance on designing resilient solutions, see Designing resilient


applications for Azure.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

To explore the cost of running this scenario, all of the services are pre-configured in the
cost calculator. To see how the pricing would change for your particular use case,
change the appropriate variables to match your expected traffic.

We have provided three sample cost profiles based on amount of traffic (we assume all
images are 100 kb in size):

Small : this pricing example correlates to processing < 5000 images a month.
Medium : this pricing example correlates to processing 500,000 images a month.
Large : this pricing example correlates to processing 50 million images a month.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

David Stanford | Principal Program Manager


Ashish Chauhan | Senior Solution Architect

Next steps
Product documentation

What is Computer Vision?


AI enrichment in Azure Cognitive Search
Introduction to Azure Functions
What is Azure Event Grid?
Introduction to Azure Blob storage
Welcome to Azure Cosmos DB

For a guided learning path, see:

Build a serverless web app in Azure


Classify images with the Custom Vision service
Use AI to recognize objects in images by using the Custom Vision service
Classify endangered bird species with Custom Vision
Classify images with the Microsoft Custom Vision Service
Detect objects in images with the Custom Vision service

Before deploying this example scenario in a production environment, review


recommended practices for optimizing the performance and reliability of Azure
Functions.

Related resources
AI enrichment with image and natural language processing in Azure Cognitive Search
Use a speech-to-text transcription
pipeline to analyze recorded
conversations
Azure AI Speech Azure AI Language Azure AI services Azure Synapse Analytics Azure Logic Apps

Speech recognition and analysis of recorded customer calls can provide your business
with valuable information about current trends, product shortcomings, and successes.

The example solution described in this article outlines a repeatable pipeline for
transcribing and analyzing conversation data.

Architecture
The architecture consists of two pipelines: A transcription pipeline to convert audio to
text, and an enrichment and visualization pipeline.

Transcription pipeline

Download a Visio file of this architecture.

Dataflow
1. Audio files are uploaded to an Azure Storage account via any supported method.
You can use a UI-based tool like Azure Storage Explorer or use a storage SDK or
API.
2. The upload to Azure Storage triggers an Azure logic app. The logic app accesses
any necessary credentials in Azure Key Vault and makes a request to the Speech
service's batch transcription API.
3. The logic app submits the audio files call to the Speech service, including optional
settings for speaker diarization.
4. The Speech service completes the batch transcription and loads the transcription
results to the Storage account.

Enrichment and visualization pipeline

Download a Visio file of this architecture.

Dataflow

5. An Azure Synapse Analytics pipeline runs to retrieve and process the transcribed
audio text.
6. The pipeline sends processed text via an API call to the Language service. The
service performs various natural language processing (NLP) enrichments, like
sentiment and opinion mining, summarization, and custom and pre-built named
entity recognition.
7. The processed data is stored in an Azure Synapse Analytics SQL pool, where it can
be served to visualization tools like Power BI.

Components
Azure Blob Storage. Massively scalable and secure object storage for cloud-
native workloads, archives, data lakes, high-performance computing, and machine
learning. In this solution, it stores the audio files and transcription results and
serves as a data lake for downstream analytics.
Azure Logic Apps. An integration platform as a service (iPaaS) that's built on a
containerized runtime. In this solution, it integrates storage and speech AI services.
Azure Cognitive Services Speech service. An AI-based API that provides speech
capabilities like speech-to-text, text-to-speech, speech translation, and speaker
recognition. Its batch transcription functionality is used in this solution.
Azure Cognitive Service for Language. An AI-based managed service that
provides natural language capabilities like sentiment analysis, entity extraction, and
automated question answering.
Azure Synapse Analytics. A suite of services that provide data integration,
enterprise data warehousing, and big data analytics. In this solution, it transforms
and enriches transcription data and serves data to downstream visualization tools.
Power BI. A data modeling and visual analytics tool. In this solution, it presents
transcribed audio insights to users and decision makers.

Alternatives
Here are some alternative approaches to this solution architecture:

Consider configuring the Blob Storage account to use a hierarchical namespace.


This configuration provides ACL-based security controls and can improve
performance for some big data workloads.
You might be able to use Azure Functions as a code-first integration tool instead of
Logic Apps or Azure Synapse pipelines, depending on the size and scale of the
workload.

Scenario details
Customer care centers are an integral part of the success of many businesses in many
industries. This solution uses the Speech API from Azure Cognitive Services for the audio
transcription and diarization of recorded customer calls. Azure Synapse Analytics is used
to process and perform NLP tasks like sentiment analysis and custom named entity
recognition through API calls to Azure Cognitive Service for Language.

You can use the services and pipeline described here to process transcribed text to
recognize and remove sensitive information, perform sentiment analysis, and more. You
can scale the services and pipeline to accommodate any volume of recorded data.
Potential use cases
This solution can provide value to organizations in many industries, including
telecommunications, financial services, and government. It applies to any organization
that records conversations. In particular, customer-facing or internal call centers or
support desks can benefit from the insights derived from this solution.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that you can use to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

The request to the Speech API can include a shared access signature (SAS) URI for
a destination container in Azure Storage. A SAS URI enables the Speech service to
directly output the transcription files to the container location. If your organization
doesn't allow the use of SAS URIs for storage, you need to implement a function to
periodically poll the Speech API for completed assets.
Credentials like account or API keys should be stored in Azure Key Vault as secrets.
Configure your Logic Apps and Azure Synapse pipelines to access the key vault by
using managed identities to avoid storing secrets in application settings or code.
The audio files that are stored in the blob might contain sensitive customer data. If
multiple clients are using the solution, you need to restrict access to these files.
Use hierarchical namespace on the storage account and enforce folder and file
level permissions to limit access to only the needed Microsoft Entra instance.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

All Azure services described in this architecture provide an option for pay-as-you-go
billing, so solution costs scale linearly.
Azure Synapse provides an option for serverless SQL pools, so the compute for the data
warehousing workload can be spun up on demand. If you aren't using Azure Synapse to
serve other downstream use cases, consider using serverless to reduce costs.

See Overview of the cost optimization pillar for more cost optimization strategies.

For pricing for the services suggested here, see this estimate in the Azure pricing
calculator .

Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.

The batch speech API is designed for high volume, but other Cognitive Services APIs
might have request limits for each subscription tier. Consider containerizing these APIs
to avoid throttling large-volume processing. Containers give you flexibility in
deployment, in the cloud or on-premises. You can also mitigate side effects of new
version rollouts by using containers. For more information, see Container support in
Azure Cognitive Services.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Dhanashri Kshirsagar | Senior Content Program Manager


Brady Leavitt | Dir Specialist GBB
Kirpa Singh | Senior Software Engineer
Christina Skarpathiotaki | Cloud Solution Architect

Other contributor:

Mick Alberts | Technical Writer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Quickstart: Recognize and convert speech to text
Quickstart: Create an integration workflow with multi-tenant Azure Logic Apps and
the Azure portal
Quickstart: Get started with Language Studio
Cognitive Services in Azure Synapse Analytics
What is the Speech service?
What is Azure Logic Apps?
What is Azure Cognitive Service for Language?
What is Azure Synapse Analytics?
Extract insights from text with the Language service
Model, query, and explore data in Azure Synapse

Related resources
Natural language processing technology
Optimize marketing with machine learning
Big data analytics with enterprise-grade security using Azure Synapse
Extract and analyze call center
data
Azure Blob Storage Azure AI Speech Azure AI services Power BI

This article describes how to extract insights from customer conversations at a call
center by using Azure AI services and Azure OpenAI Service. Use these real-time and
post-call analytics to improve call center efficiency and customer satisfaction.

Architecture

Intelligent transcription Interact and visualize

Extract insights from call transcripts


Insights in Power BI
(Near real-time)
Caller

Extraction Storage Enrichment


File Audio
Person-to-person upload files
conversation
Azure AI Web app
Azure Language
Telephony Azure
Blob Storage
server Azure Blob Storage
Azure
Blob Storage
speech to text
Azure OpenAI CRM
Call-center agent Service

Detailed call
history, summaries,
reasons for calling


Download a PowerPoint file of this architecture.

Dataflow
1. A phone call between an agent and a customer is recorded and stored in Azure
Blob Storage. Audio files are uploaded to an Azure Storage account via a
supported method, such as the UI-based tool, Azure Storage Explorer, or a Storage
SDK or API.

2. Azure AI Speech is used to transcribe audio files in batch mode asynchronously


with speaker diarization enabled. The transcription results are persisted in Blob
Storage.
3. Azure AI Language is used to detect and redact personal data in the transcript.

For batch mode transcription and personal data detection and redaction, use the
AI services Ingestion Client tool. The Ingestion Client tool uses a no-code approach
for call center transcription.

4. Azure OpenAI is used to process the transcript and extract entities, summarize the
conversation, and analyze sentiments. The processed output is stored in Blob
Storage and then analyzed and visualized by using other services. You can also
store the output in a datastore for keeping track of metadata and for reporting.
Use Azure OpenAI to process the stored transcription information.

5. Power BI or a custom web application that's hosted by App Service is used to


visualize the output. Both options provide near real-time insights. You can store
this output in a CRM, so agents have contextual information about why the
customer called and can quickly solve potential problems. This process is fully
automated, which saves the agents time and effort.

Components
Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for languages like .NET, Node.js, and Python.
Applications can access files on Blob Storage via HTTP or HTTPS. Blob Storage has
hot, cool, and archive access tiers for storing large amounts of data, which
optimizes cost.

Azure OpenAI provides access to the Azure OpenAI language models, including
GPT-3, Codex, and the embeddings model series, for content generation,
summarization, semantic search, and natural language-to-code translation. You
can access the service through REST APIs, Python SDK, or the web-based interface
in the Azure OpenAI Studio .

Azure AI Speech is an AI-based API that provides speech capabilities like speech-
to-text, text-to-speech, speech translation, and speaker recognition. This
architecture uses the Azure AI Speech batch transcription functionality.

Azure AI Language consolidates the Azure natural-language processing services.


For information about prebuilt and customizable options, see Azure AI Language
available features.

Language Studio provides a UI for exploring and analyzing AI services for


language features. Language Studio provides options for building, tagging,
training, and deploying custom models.
Power BI is a software-as-a-service (SaaS) that provides visual and interactive
insights for business analytics. It provides transformation capabilities and connects
to other data sources.

Alternatives
Depending on your scenario, you can add the following workflows.

Perform conversation summarization by using the prebuilt model in Azure AI


Language.
Depending on the size and scale of your workload, you can use Azure Functions as
a code-first integration tool to perform text-processing steps, like text
summarization on extracted data.
Deploy and implement a custom speech-to-text solution.

Scenario details
This solution uses Azure AI Speech to convert audio into written text. Azure AI Language
redacts sensitive information in the conversation transcription. Azure OpenAI extracts
insights from customer conversation to improve call center efficiency and customer
satisfaction. Use this solution to process transcribed text, recognize and remove
sensitive information, and perform sentiment analysis. Scale the services and the
pipeline to accommodate any volume of recorded data.

Potential use cases


This solution provides value to organizations in industries like telecommunications and
financial services. It applies to any organization that records conversations. Customer-
facing or internal call centers or support desks benefit from using this solution.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
Find the availability service level agreement (SLA) for each component in SLAs for
online services .
To design high-availability applications with Storage accounts, see the
configuration options.
To ensure resiliency of the compute services and datastores in this scenario, use
failure mode for services like Azure Functions and Storage. For more information,
see the resiliency checklist for Azure services.
Back up and recover your Form Recognizer models.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Implement data protection, identity and access management, and network security
recommendations for Blob Storage, AI services, and Azure OpenAI.
Configure AI services virtual networks.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

The total cost of this solution depends on the pricing tier of your services. Factors that
can affect the price of each component are:

The number of documents that you process.


The number of concurrent requests that your application receives.
The size of the data that you store after processing.
Your deployment region.

For more information, see the following resources:

Azure OpenAI pricing


Blob Storage pricing
Azure AI Language pricing
Azure Machine Learning pricing

Use the Azure pricing calculator to estimate your solution cost.

Performance efficiency
Performance efficiency is the ability of your workload to meet the demands placed on it
by users in an efficient manner. For more information, see Overview of the performance
efficiency pillar.

When high volumes of data are processed, it can expose performance bottlenecks. To
ensure proper performance efficiency, understand and plan for the scaling options to
use with the AI services autoscale feature.

The batch speech API is designed for high volumes, but other AI services APIs might
have request limits, depending on the subscription tier. Consider containerizing AI
services APIs to avoid slowing down large-volume processing. Containers provide
deployment flexibility in the cloud and on-premises. Mitigate side effects of new version
rollouts by using containers. For more information, see Container support in AI services.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Dixit Arora | Senior Customer Engineer, ISV DN CoE


Jyotsna Ravi | Principal Customer Engineer, ISV DN CoE

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is Azure AI Speech?
What is Azure OpenAI?
What is Azure Machine Learning?
Introduction to Blob Storage
What is Azure AI Language?
Introduction to Azure Data Lake Storage Gen2
What is Power BI?
Ingestion Client with AI services
Post-call transcription and analytics

Related resources
Use a speech-to-text transcription pipeline to analyze recorded conversations
Deploy a custom speech-to-text solution
Create custom language and acoustic models
Deploy a custom speech-to-text solution
Determine customer lifetime and
churn with Azure AI services
Azure Data Lake Storage Azure Databricks Azure Machine Learning Azure Analysis Services

This scenario shows a solution for creating predictive models of customer lifetime
value and churn rate by using Azure AI technologies.

Architecture
Azure Data Factory Azure Databricks Azure Machine Learning Azure Analysis Services
Event Grid Subscriptions MLflow MLflow BI Platforms
Azure Kubernetes Service

Model training
Machine Serving phase
learning registry

Azure
intelligent
analytics and
AI platforms
Power BI
Feature engineering dashboard
Data Factory Machine
learning
deployment
layer
Data processing
Dashboard

Azure storage
technologies

Azure Data Lake Storage Azure SQL Database Azure Analysis Services

Microsoft 
Azure

Download a Visio file of this architecture.

Dataflow
1. Ingestion and orchestration: Ingest historical, transactional, and third-party data
for the customer from on-premises data sources. Use Azure Data Factory and store
the results in Azure Data Lake Storage.
2. Data processing: Use Azure Databricks to pick up and clean the raw data from the
Data Lake Storage. Store the data in the silver layer in Azure Data Lake Storage.

3. Feature engineering: With Azure Databricks, load data from the silver layer of Data
Lake Storage. Use PySpark to enrich the data. After preparation, use feature
engineering to provide a better representation of data. Feature engineering can
also improve the performance of the machine learning algorithm.

4. Model training: In model training, the silver tier data is the model training dataset.
You can use MLflow to manage machine learning experiments. MLflow keeps track
of all metrics you need to evaluate your machine learning experiment.

MLflow parameters stores model-related parameters, such as training


hyperparameters. MLflow metrics stores model performance metrics. The machine
learning model iteratively retrains using Azure Data Factory pipelines. The model
retraining pipeline gets updated training data from the Azure Data Lake Storage
and retrains the model. The model retraining pipeline kicks off in the following
conditions:

When the accuracy of the current model in production drops below a


threshold tracked by MLflow.
When calendar triggers, based on the customer defined rules, are reached.
When data drift is detected.

5. Machine learning registry: An Azure Data Factory pipeline registers the best
machine learning model in the Azure Machine Learning Service according to the
metrics chosen. The machine learning model is deployed by using the Azure
Kubernetes Service .

6. Serving phase: In the serving phase, you can use reporting tools to work with your
model predictions. These tools include Power BI and Azure Analyses Services.

Components
Azure Analysis Services provides enterprise-grade data models in the cloud.

Azure Data Factory provides a data integration and transformation layer that
works across your digital transformation initiatives.

Azure Databricks is a data analytics platform optimized for the Microsoft Azure
cloud services platform.

Azure Machine Learning includes a range of experiences to build, train, and


deploy machine learning models and foster team collaboration.
Azure SQL Database is a database engine that handles most management
functions without your involvement. Azure SQL Database enables you to focus on
the domain-specific database administration and optimization activities for your
business.

MLflow is an open-source platform for managing the end-to-end machine learning


life cycle.

Alternatives
Data Factory orchestrates the workflows for your data pipeline. If you want to load
data only one time or on demand, use tools like SQL Server bulk copy and AzCopy
to copy data into Azure Blob Storage . You can then load the data directly into
Azure Synapse Analytics using PolyBase.

Some business intelligence tools may not support Azure Analysis Services. The
curated data can instead be accessed directly from Azure SQL Database. Data is
stored using Azure Data Lake Storage and accessed using Azure Databricks
storage for data processing.

Scenario details
Customer lifetime value measures the net profit from a customer. This metric includes
profit from the customer's whole relationship with your company. Churn or churn rate
measures the number of individuals or items moving out of a group over a period.

This retail customer scenario classifies your customers based on marketing and
economic measures. This scenario also creates a customer segmentation based on
several metrics. It trains a multi-class classifier on new data. The resulting model scores
batches of new customer orders through a regularly scheduled Azure Databricks
notebook job.

This solution demonstrates how to interconnect the following Azure AI technologies:

Use Azure Data Lake and Azure Databricks to implement best practices for data
operations.
Use Azure Databricks to do exploratory data analysis.
Use MLflow to track machine learning experiments.
Batch score machine learning models on Azure Databricks.
Use Azure Machine Learning to model registration and deployment.
Use Azure Data Factory and Azure Databricks notebooks to orchestrate the MLOps
pipeline.
Potential use cases
This solution is ideal for the retail industry. It's helpful in the following use cases:

In marketing, to determine how much to spend to acquire a customer.


For product teams, to tailor products and services for their best customers.
For customer support, to decide how much to spend to service and keep a
customer.
For sales representatives, to decide what types of customers to spend the most
time trying to acquire.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Availability
Azure components offer availability through redundancy and as specified in service-level
agreements (SLAs):

For information about Data Factory pipelines, see SLA for Data Factory .
For information about Azure Databricks, see Azure Databricks .
Data Lake Storage offers availability through redundancy. See Azure Storage
redundancy.

Scalability
This scenario uses Azure Data Lake Storage to store data for machine learning models
and predictions. Azure Storage is scalable. It can store and serve many exabytes of data.
This amount of storage is available with throughput measured in gigabits per second
(Gbps). Processing runs at near-constant per-request latencies. Latencies are measured
at the service, account, and file levels.

This scenario uses Azure Databricks clusters, which enable autoscaling by default.
Autoscaling enables Databricks during runtime to dynamically reallocate resources. With
autoscaling, you don't need to start a cluster to match a workload, which makes it easier
to achieve high cluster usage.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Protect assets by using controls on network traffic originating in Azure, between on-
premises and Azure hosted resources, and traffic to and from Azure. For instance, Azure
self-hosted integration runtime securely moves data from on-premises data storage to
Azure.

Use Azure Key Vault and Databricks scoped secret to access data in Azure Data Lake
Storage.

Azure services are either deployed in a secure virtual network or accessed using the
Azure Private Link feature. If necessary, row-level security provides granular access to
individual users in Azure Analysis Services or SQL Database.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

Azure Databricks is a premium Apache Spark offering with an associated cost.

There are standard and premium Databricks pricing tiers. For this scenario, the standard
pricing tier is sufficient. If your application requires automatically scaling clusters to
handle larger workloads or interactive Databricks dashboards, you might need the
premium tier.

Costs related to this use case depend on the standard pricing for the following services
for your usage:

Azure Databricks pricing


Azure Data Factory pricing
Azure Data Lake Storage pricing
Azure Machine Learning pricing

To estimate the cost of Azure products and configurations, visit the Azure pricing
calculator .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:

Giulia Gallo | Senior Cloud Solution Architect

Next steps
Azure Machine Learning
Introduction to Azure Data Lake Storage Gen2
Azure Databricks
Azure Data Factory

Related resources
Artificial intelligence
MLOps for Python models using Azure Machine Learning
Customer churn prediction using real-time analytics
Predict Length of Stay and Patient Flow
Batch scoring for deep learning
models using Azure Machine
Learning pipelines
Azure Logic Apps Azure Machine Learning Azure Role-based access control Azure Storage

This reference architecture shows how to apply neural-style transfer to a video, using
Azure Machine Learning. Style transfer is a deep learning technique that composes an
existing image in the style of another image. You can generalize this architecture for any
scenario that uses batch scoring with deep learning.

Architecture

Download a Visio file of this architecture.

Workflow
This architecture consists of the following components.

Compute
Azure Machine Learning uses pipelines to create reproducible and easy-to-manage
sequences of computation. It also offers a managed compute target (on which a
pipeline computation can run) called Azure Machine Learning Compute for training,
deploying, and scoring machine learning models.

Storage

Azure Blob Storage stores all the images (input images, style images, and output
images). Azure Machine Learning integrates with Blob Storage so that users don't have
to manually move data across compute platforms and blob storages. Blob Storage is
also cost-effective for the performance that this workload requires.

Trigger

Azure Logic Apps triggers the workflow. When the Logic App detects that a blob has
been added to the container, it triggers the Azure Machine Learning pipeline. Logic
Apps is a good fit for this reference architecture because it's an easy way to detect
changes to blob storage, with an easy process for changing the trigger.

Preprocess and postprocess the data

This reference architecture uses video footage of an orangutan in a tree.

1. Use FFmpeg to extract the audio file from the video footage, so that the audio
file can be stitched back into the output video later.
2. Use FFmpeg to break the video into individual frames. The frames are processed
independently, in parallel.
3. At this point, you can apply neural style transfer to each individual frame in
parallel.
4. After each frame has been processed, use FFmpeg to restitch the frames back
together.
5. Finally, reattach the audio file to the restitched footage.

Components
Azure Machine Learning
Azure Blob Storage
Azure Logic Apps

Solution details
This reference architecture is designed for workloads that are triggered by the presence
of new media in Azure storage.

Processing involves the following steps:

1. Upload a video file to Azure Blob Storage.


2. The video file triggers Azure Logic Apps to send a request to the Azure Machine
Learning pipeline published endpoint.
3. The pipeline processes the video, applies style transfer with MPI, and
postprocesses the video.
4. The output is saved back to Blob Storage once the pipeline is completed.

Potential use cases


A media organization has a video whose style they want to change to look like a specific
painting. The organization wants to apply this style to all frames of the video in a timely
manner and in an automated fashion. For more background about neural style transfer
algorithms, see Image Style Transfer Using Convolutional Neural Networks (PDF).

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.

GPU versus CPU


For deep learning workloads, GPUs generally out-perform CPUs by a considerable
amount, to the extent that a sizeable cluster of CPUs is typically needed to get
comparable performance. Although you can use only CPUs in this architecture, GPUs
provide a much better cost/performance profile. We recommend using the latest NCv3
series of GPU optimized VMs.

GPUs aren't enabled by default in all regions. Make sure to select a region with GPUs
enabled. In addition, subscriptions have a default quota of zero cores for GPU-optimized
VMs. You can raise this quota by opening a support request. Make sure that your
subscription has enough quota to run your workload.

Parallelize across VMs versus cores

When you run a style transfer process as a batch job, the jobs that run primarily on
GPUs need to be parallelized across VMs. Two approaches are possible: You can create a
larger cluster using VMs that have a single GPU, or create a smaller cluster using VMs
with many GPUs.

For this workload, these two options have comparable performance. Using fewer VMs
with more GPUs per VM can help to reduce data movement. However, the data volume
per job for this workload isn't large, so you won't observe much throttling by Blob
Storage.

MPI step

When creating the Azure Machine Learning pipeline, one of the steps used to perform
parallel computation is the (message processing interface) MPI step. The MPI step helps
split the data evenly across the available nodes. The MPI step doesn't execute until all
the requested nodes are ready. Should one node fail or get preempted (if it's a low-
priority virtual machine), the MPI step will have to be rerun.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar. This section
contains considerations for building secure solutions.

Restrict access to Azure Blob Storage


In this reference architecture, Azure Blob Storage is the main storage component that
needs to be protected. The baseline deployment shown in the GitHub repo uses storage
account keys to access the blob storage. For further control and protection, consider
using a shared access signature (SAS) instead. This grants limited access to objects in
storage, without needing to hard code the account keys or save them in plaintext. This
approach is especially useful because account keys are visible in plaintext inside of Logic
App's designer interface. Using an SAS also helps to ensure that the storage account has
proper governance, and that access is granted only to the people intended to have it.
For scenarios with more sensitive data, make sure that all of your storage keys are
protected, because these keys grant full access to all input and output data from the
workload.

Data encryption and data movement


This reference architecture uses style transfer as an example of a batch scoring process.
For more data-sensitive scenarios, the data in storage should be encrypted at rest. Each
time data is moved from one location to the next, use Transport Layer Security (TSL) to
secure the data transfer. For more information, see Azure Storage security guide.

Secure your computation in a virtual network


When deploying your Machine Learning compute cluster, you can configure your cluster
to be provisioned inside a subnet of a virtual network. This subnet allows the compute
nodes in the cluster to communicate securely with other virtual machines.

Protect against malicious activity


In scenarios where there are multiple users, make sure that sensitive data is protected
against malicious activity. If other users are given access to this deployment to
customize the input data, note the following precautions and considerations:

Use Azure role-based access control (Azure RBAC) to limit users' access to only the
resources they need.
Provision two separate storage accounts. Store input and output data in the first
account. External users can be given access to this account. Store executable
scripts and output log files in the other account. External users should not have
access to this account. This separation ensures that external users can't modify any
executable files (to inject malicious code), and don't have access to log files, which
could hold sensitive information.
Malicious users can perform a DDoS attack on the job queue or inject
malformed poison messages in the job queue, causing the system to lock up or
causing dequeuing errors.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Compared to the storage and scheduling components, the compute resources used in
this reference architecture by far dominate in terms of costs. One of the main challenges
is effectively parallelizing the work across a cluster of GPU-enabled machines.

The Azure Machine Learning Compute cluster size can automatically scale up and down
depending on the jobs in the queue. You can enable autoscale programmatically by
setting the minimum and maximum nodes.

For work that doesn't require immediate processing, configure autoscale so the default
state (minimum) is a cluster of zero nodes. With this configuration, the cluster starts with
zero nodes and only scales up when it detects jobs in the queue. If the batch scoring
process happens only a few times a day or less, this setting results in significant cost
savings.

Autoscaling may not be appropriate for batch jobs that happen too close to each other.
The time that it takes for a cluster to spin up and spin down also incur a cost, so if a
batch workload begins only a few minutes after the previous job ends, it might be more
cost effective to keep the cluster running between jobs.

Azure Machine Learning Compute also supports low-priority virtual machines, which
allows you to run your computation on discounted virtual machines, with the caveat that
they may be preempted at any time. Low-priority virtual machines are ideal for non-
critical batch scoring workloads.

Monitor batch jobs


While running your job, it's important to monitor the progress and make sure that the
job is working as expected. However, it can be a challenge to monitor across a cluster of
active nodes.

To check the overall state of the cluster, go to the Machine Learning service in the Azure
portal to check the state of the nodes in the cluster. If a node is inactive or a job has
failed, the error logs are saved to Blob Storage, and are also accessible in the Azure
portal.

Monitoring can be further enriched by connecting logs to Application Insights or by


running separate processes to poll for the state of the cluster and its jobs.

Log with Azure Machine Learning


Azure Machine Learning will automatically log all stdout/stderr to the associated Blob
Storage account. Unless otherwise specified, your Azure Machine Learning workspace
will automatically provision a storage account and dump your logs into it. You can also
use a storage navigation tool such as Azure Storage Explorer , which is an easier way
to navigate log files.

Deploy this scenario


To deploy this reference architecture, follow the steps described in the GitHub repo .

You can also deploy a batch scoring architecture for deep learning models by using the
Azure Kubernetes Service. Follow the steps described in this GitHub repo .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Jian Tang | Program Manager II

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Batch scoring of Spark models on Azure Databricks
Batch scoring of Python models on Azure
Batch scoring with R models to forecast sales

Related resources
Artificial intelligence architecture
What is Azure Machine Learning?
Azure Machine Learning pipelines
Batch scoring of Python models
on Azure
Azure Container Registry Azure Event Hubs Azure Machine Learning Azure SQL Database

Azure Stream Analytics

This architecture guide shows how to build a scalable solution for batch scoring models
Azure Machine Learning. The solution can be used as a template and can generalize to
different problems.

Architecture

Download a Visio file of this architecture.

Workflow
This architecture guide is applicable for both streaming and static data, provided that
the ingestion process is adapted to the data type. The following steps and components
describe the ingestion of these two types of data.

Streaming data:

1. Streaming data originates from IoT Sensors, where new events are streamed at
frequent intervals.
2. Incoming streaming events are queued using Azure Event Hubs, and then pre-
processed using Azure Stream Analytics.

Azure Event Hubs. This message ingestion service can ingest millions of event
messages per second. In this architecture, sensors send a stream of data to
the event hub.
Azure Stream Analytics. An event-processing engine. A Stream Analytics job
reads the data streams from the event hub and performs stream processing.

Static data:

3. Static datasets can be stored as files within Azure Data Lake Storage or in tabular
form in Azure Synapse or Azure SQL Database .
4. Azure Data Factory can be used to aggregate or pre-process the stored dataset.

The remaining architecture, after data ingestion, is equal for both streaming and static
data, and consists of the following steps and components:

5. The ingested, aggregated and/or pre-processed data can be stored as documents


within Azure Data Lake Storage or in tabular form in Azure Synapse or Azure
SQL Database . This data will then be consumed by Azure Machine Learning.
6. Azure Machine Learning is used for training, deploying, and managing machine
learning models at scale. In the context of batch scoring, Azure Machine Learning
creates a cluster of virtual machines with an automatic scaling option, where jobs
are executed in parallel as of Python scripts.
7. Models are deployed as Managed Batch Endpoints, which are then used to do
batch inferencing on large volumes of data over a period of time. Batch endpoints
receive pointers to data and run jobs asynchronously to process the data in parallel
on compute clusters.
8. The inference results can be stored as documents within Azure Data Lake
Storage or in tabular form in Azure Synapse or Azure SQL Database .
9. Visualize: The stored model results can be consumed through user interfaces, such
as Power BI dashboards, or through custom-built web applications.

Components
Azure Event Hubs
Azure Stream Analytics
Azure SQL Database
Azure Synapse Analytics
Azure Data Lake Storage
Azure Data Factory
Azure Machine Learning
Azure Machine Learning Endpoints
Microsoft Power BI on Azure
Azure Web Apps
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Performance
For standard Python models, it's generally accepted that CPUs are sufficient to handle
the workload. This architecture uses CPUs. However, for deep learning workloads, GPUs
generally outperform CPUs by a considerable amount; a sizeable cluster of CPUs is
usually needed to get comparable performance.

Parallelize across VMs versus cores


When you run scoring processes of many models in batch mode, the jobs need to be
parallelized across VMs. Two approaches are possible:

Create a larger cluster using low-cost VMs.


Create a smaller cluster using high performing VMs with more cores available on
each.

In general, scoring of standard Python models isn't as demanding as scoring of deep


learning models, and a small cluster should be able to handle a large number of queued
models efficiently. You can increase the number of cluster nodes as the dataset sizes
increase.

For convenience in this scenario, one scoring task is submitted within a single Azure
Machine Learning pipeline step. However, it can be more efficient to score multiple data
chunks within the same pipeline step. In those cases, write custom code to read in
multiple datasets and execute the scoring script during a single-step execution.

Management
Monitor jobs. It's important to monitor the progress of running jobs. However, it
can be a challenge to monitor across a cluster of active nodes. To inspect the state
of the nodes in the cluster, use the Azure portal to manage the Machine
Learning workspace. If a node is inactive or a job has failed, the error logs are
saved to blob storage, and are also accessible in the Pipelines section. For richer
monitoring, connect logs to Application Insights, or run separate processes to poll
for the state of the cluster and its jobs.
Logging. Machine Learning logs all stdout/stderr to the associated Azure Storage
account. To easily view the log files, use a storage navigation tool such as Azure
Storage Explorer .

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

The most expensive components used in this architecture guide are the compute
resources. The compute cluster size scales up and down depending on the jobs in the
queue. Enable automatic scaling programmatically through the Python SDK by
modifying the compute's provisioning configuration. Or, use the Azure CLI to set the
automatic scaling parameters of the cluster.

For work that doesn't require immediate processing, configure the automatic scaling
formula so the default state (minimum) is a cluster of zero nodes. With this
configuration, the cluster starts with zero nodes and only scales up when it detects jobs
in the queue. If the batch scoring process happens only a few times a day or less, this
setting enables significant cost savings.

Automatic scaling might not be appropriate for batch jobs that occur too close to each
other. Because the time that it takes for a cluster to spin up and spin down incurs a cost,
if a batch workload begins only a few minutes after the previous job ends, it might be
more cost effective to keep the cluster running between jobs. This strategy depends on
whether scoring processes are scheduled to run at a high frequency (every hour, for
example), or less frequently (once a month, for example).

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Carlos Alexandre Santos | Senior Specialized AI Cloud Solution Architect


Said Bleik | Principal Applied Scientist Manager

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Product documentation:

What is Azure Blob Storage?


Introduction to private Docker container registries in Azure
Azure Event Hubs
What is Azure Machine Learning?
What is Azure SQL Database?
Welcome to Azure Stream Analytics

Microsoft Learn modules:

Deploy Azure SQL Database


Enable reliable messaging for Big Data applications using Azure Event Hubs
Explore Azure Event Hubs
Implement a Data Streaming Solution with Azure Streaming Analytics
Introduction to machine learning
Manage container images in Azure Container Registry

Related resources
Artificial intelligence (AI) - Architectural overview
Batch scoring for deep learning models using Azure Machine Learning pipelines
Batch scoring of Spark models on Azure Databricks
MLOps for Python models using Azure Machine Learning
Batch scoring with R models to
forecast sales
Azure Batch Azure Blob Storage Azure Container Instances Azure Logic Apps Azure Machine Learning

This reference architecture shows how to perform batch scoring with R models using
Azure Batch. Azure Batch works well with intrinsically parallel workloads and includes job
scheduling and compute management. Batch inference (scoring) is widely used to
segment customers, forecast sales, predict customer behaviors, predict maintenance, or
improve cyber security.

Download a Visio file of this architecture.

Workflow
This architecture consists of the following components.

Azure Batch runs forecast generation jobs in parallel on a cluster of virtual machines.
Predictions are made using pre-trained machine learning models implemented in R.
Azure Batch can automatically scale the number of VMs based on the number of jobs
submitted to the cluster. On each node, an R script runs within a Docker container to
score data and generate forecasts.

Azure Blob Storage stores the input data, the pre-trained machine learning models, and
the forecast results. It delivers cost-effective storage for the performance that this
workload requires.
Azure Container Instances provides serverless compute on demand. In this case, a
container instance is deployed on a schedule to trigger the Batch jobs that generate the
forecasts. The Batch jobs are triggered from an R script using the doAzureParallel
package. The container instance automatically shuts down once the jobs have finished.

Azure Logic Apps triggers the entire workflow by deploying the container instances on a
schedule. An Azure Container Instances connector in Logic Apps allows an instance to
be deployed upon a range of trigger events.

Components
Azure Batch
Azure Blob Storage
Azure Container Instances
Azure Logic Apps

Solution details
Although the following scenario is based on retail store sales forecasting, its architecture
can be generalized for any scenario requiring the generation of predictions on a larger
scale using R models. A reference implementation for this architecture is available on
GitHub .

Potential use cases


A supermarket chain needs to forecast sales of products over the upcoming quarter. The
forecast allows the company to manage its supply chain better and ensure it can meet
demand for products at each of its stores. The company updates its forecasts every
week as new sales data from the previous week becomes available and the product
marketing strategy for next quarter is set. Quantile forecasts are generated to estimate
the uncertainty of the individual sales forecasts.

Processing involves the following steps:

1. An Azure Logic App triggers the forecast generation process once per week.

2. The logic app starts an Azure Container Instance running the scheduler Docker
container, which triggers the scoring jobs on the Batch cluster.

3. Scoring jobs run in parallel across the nodes of the Batch cluster. Each node:

a. Pulls the worker Docker image and starts a container.


b. Reads input data and pre-trained R models from Azure Blob storage.

c. Scores the data to produce forecasts.

d. Writes forecast results to blob storage.

The following figure shows the forecasted sales for four products (SKUs) in one store.
The black line is the sales history, the dashed line is the median (q50) forecast, the pink
band represents the 25th and 75th percentiles, and the blue band represents the 50th
and 95th percentiles.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Performance

Containerized deployment
With this architecture, all R scripts run within Docker containers. Using containers
ensures that the scripts run in a consistent environment every time, with the same R
version and packages versions. Separate Docker images are used for the scheduler and
worker containers, because each has a different set of R package dependencies.

Azure Container Instances provides a serverless environment to run the scheduler


container. The scheduler container runs an R script that triggers the individual scoring
jobs running on an Azure Batch cluster.

Each node of the Batch cluster runs the worker container, which executes the scoring
script.

Parallelize the workload


When batch scoring data with R models, consider how to parallelize the workload. The
input data must be partitioned so that the scoring operation can be distributed across
the cluster nodes. Try different approaches to discover the best choice for distributing
your workload. On a case-by-case basis, consider:

How much data can be loaded and processed in the memory of a single node.
The overhead of starting each batch job.
The overhead of loading the R models.

In the scenario used for this example, the model objects are large, and it takes only a
few seconds to generate a forecast for individual products. For this reason, you can
group the products and execute a single Batch job per node. A loop within each job
generates forecasts for the products sequentially. This method is the most efficient way
to parallelize this particular workload. It avoids the overhead of starting many smaller
Batch jobs and repeatedly loading the R models.

An alternative approach is to trigger one Batch job per product. Azure Batch
automatically forms a queue of jobs and submits them to be executed on the cluster as
nodes become available. Use automatic scaling to adjust the number of nodes in the
cluster, depending on the number of jobs. This approach is useful if it takes a relatively
long time to complete each scoring operation, which justifies the overhead of starting
the jobs and reloading the model objects. This approach is also simpler to implement
and gives you the flexibility to use automatic scaling, an important consideration if the
size of the total workload isn't known in advance.
Monitor Azure Batch jobs
Monitor and terminate Batch jobs from the Jobs pane of the Batch account in the Azure
portal. Monitor the batch cluster, including the state of individual nodes, from the Pools
pane.

Log with doAzureParallel


The doAzureParallel package automatically collects logs of all stdout/stderr for every job
submitted on Azure Batch. These logs can be found in the storage account created at
setup. To view them, use a storage navigation tool such as Azure Storage Explorer or
Azure portal.

To quickly debug Batch jobs during development, view the logs in your local R session.
For more information, see using the Configure and submit training runs.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

The compute resources used in this reference architecture are the most costly
components. For this scenario, a cluster of fixed size is created whenever the job is
triggered and then shut down after the job has completed. Cost is incurred only while
the cluster nodes are starting, running, or shutting down. This approach is suitable for a
scenario where the compute resources required to generate the forecasts remain
relatively constant from job to job.

In scenarios where the amount of compute required to complete the job isn't known in
advance, it may be more suitable to use automatic scaling. With this approach, the size
of the cluster is scaled up or down depending on the size of the job. Azure Batch
supports a range of autoscale formulae, which you can set when defining the cluster
using the doAzureParallel API.

For some scenarios, the time between jobs may be too short to shut down and start up
the cluster. In these cases, keep the cluster running between jobs if appropriate.

Azure Batch and doAzureParallel support the use of low-priority VMs. These VMs come
with a significant discount but risk being appropriated by other higher priority
workloads. Therefore, the use of low-priority VMs isn't recommended for critical
production workloads. However, they're useful for experimental or development
workloads.

Deploy this scenario


To deploy this reference architecture, follow the steps described in the GitHub repo.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Angus Taylor | Senior Data Scientist

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is Azure Machine Learning?
Azure Machine Learning pipelines

Related resources
Artificial intelligence architecture
Batch scoring of Spark models on Azure Databricks
Batch scoring of Python models on Azure
Batch scoring for deep learning models
Batch scoring of Spark models on
Azure Databricks
Azure Active Directory Azure Databricks Azure Data Factory Azure Blob Storage

This reference architecture shows how to build a scalable solution for batch scoring an
Apache Spark classification model on a schedule using Azure Databricks. Azure
Databricks is an Apache Spark-based analytics platform optimized for Azure. Azure
Databricks offers three environments for developing data intensive applications:
Databricks SQL, Databricks Data Science & Engineering, and Databricks Machine
Learning. Databricks Machine Learning is an integrated end-to-end machine learning
environment incorporating managed services for experiment tracking, model training,
feature development and management, and feature and model serving. You can use this
reference architecture as a template that can be generalized to other scenarios. A
reference implementation for this architecture is available on GitHub .

Apache® and Apache Spark® are either registered trademarks or trademarks of the
Apache Software Foundation in the United States and/or other countries. No endorsement
by The Apache Software Foundation is implied by the use of these marks.

Architecture
Azure Databricks

Ingest Feature engineering Training

Spark MLLib model


Machine data Training data

Databricks store

Scoring data
Batch jobs scheduler

Results data

Microsoft
Azure

Download a Visio file of this architecture.

Workflow
The architecture defines a data flow that is entirely contained within Azure Databricks
based on a set of sequentially executed notebooks . It consists of the following
components:

Data files. The reference implementation uses a simulated data set contained in five
static data files.

Ingestion. The data ingestion notebook downloads the input data files into a collection
of Databricks data sets. In a real-world scenario, data from IoT devices would stream
onto Databricks-accessible storage such as Azure SQL or Azure Blob storage. Databricks
supports multiple data sources .

Training pipeline. This notebook executes the feature engineering notebook to create
an analysis data set from the ingested data. It then executes a model building notebook
that trains the machine learning model using the Apache Spark MLlib scalable
machine learning library.
Scoring pipeline. This notebook executes the feature engineering notebook to create
scoring data set from the ingested data and executes the scoring notebook. The scoring
notebook uses the trained Spark MLlib model to generate predictions for the
observations in the scoring data set. The predictions are stored in the results store, a
new data set on the Databricks data store.

Scheduler. A scheduled Databricks job handles batch scoring with the Spark model.
The job executes the scoring pipeline notebook, passing variable arguments through
notebook parameters to specify the details for constructing the scoring data set and
where to store the results data set.

Solution details
The scenario is constructed as a pipeline flow. Each notebook is optimized to perform in
a batch setting for each of the operations: ingestion, feature engineering, model
building, and model scorings. The feature engineering notebook is designed to
generate a general data set for any of the training, calibration, testing, or scoring
operations. In this scenario, we use a temporal split strategy for these operations, so the
notebook parameters are used to set date-range filtering.

Because the scenario creates a batch pipeline, we provide a set of optional examination
notebooks to explore the output of the pipeline notebooks. You can find these
notebooks in the GitHub repository notebooks folder :

1a_raw-data_exploring.ipynb
2a_feature_exploration.ipynb

2b_model_testing.ipynb

3b_model_scoring_evaluation.ipynb

Potential use cases


A business in an asset-heavy industry wants to minimize the costs and downtime
associated with unexpected mechanical failures. Using IoT data collected from their
machines, they can create a predictive maintenance model. This model enables the
business to maintain components proactively and repair them before they fail. By
maximizing mechanical component use, they can control costs and reduce downtime.

A predictive maintenance model collects data from the machines and retains historical
examples of component failures. The model can then be used to monitor the current
state of the components and predict if a given component will fail soon. For common
use cases and modeling approaches, see Azure AI guide for predictive maintenance
solutions.

This reference architecture is designed for workloads that are triggered by the presence
of new data from the component machines. Processing involves the following steps:

1. Ingest the data from the external data store onto an Azure Databricks data store.

2. Train a machine learning model by transforming the data into a training data set,
then building a Spark MLlib model. MLlib consists of most common machine
learning algorithms and utilities optimized to take advantage of Spark data
scalability capabilities.

3. Apply the trained model to predict (classify) component failures by transforming


the data into a scoring data set. Score the data with the Spark MLLib model.

4. Store results on the Databricks data store for post-processing consumption.

Notebooks are provided on GitHub to perform each of these tasks.

Recommendations
Databricks is set up so you can load and deploy your trained models to make
predictions with new data. Databricks also provides other advantages:

Single sign-on support using Microsoft Entra credentials.


Job scheduler to execute jobs for production pipelines.
Fully interactive notebook with collaboration, dashboards, REST APIs.
Unlimited clusters that can scale to any size.
Advanced security, role-based access controls, and audit logs.

To interact with the Azure Databricks service, use the Databricks Workspace interface
in a web browser or the command-line interface (CLI). Access the Databricks CLI from
any platform that supports Python 2.7.9 to 3.6.

The reference implementation uses notebooks to execute tasks in sequence. Each


notebook stores intermediate data artifacts (training, test, scoring, or results data sets)
to the same data store as the input data. The goal is to make it easy for you to use it as
needed in your particular use case. In practice, you would connect your data source to
your Azure Databricks instance for the notebooks to read and write directly back into
your storage.

Monitor job execution through the Databricks user interface, the data store, or the
Databricks CLI as necessary. Monitor the cluster using the event log and other
metrics that Databricks provides.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Performance
An Azure Databricks cluster enables autoscaling by default so that during runtime,
Databricks dynamically reallocates workers to account for the characteristics of your job.
Certain parts of your pipeline may be more computationally demanding than others.
Databricks adds extra workers during these phases of your job (and removes them when
they're no longer needed). Autoscaling makes it easier to achieve high cluster utilization,
because you don't need to provision the cluster to match a workload.

Develop more complex scheduled pipelines by using Azure Data Factory with Azure
Databricks.

Storage
In this reference implementation, the data is stored directly within Databricks storage for
simplicity. In a production setting, however, you can store the data on cloud data
storage such as Azure Blob Storage . Databricks also supports Azure Data Lake
Store , Azure Synapse Analytics , Azure Cosmos DB , Apache Kafka , and Apache
Hadoop .

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

In general, use the Azure pricing calculator to estimate costs. Other considerations are
described in the Cost section in Microsoft Azure Well-Architected Framework.

Azure Databricks is a premium Spark offering with an associated cost. In addition, there
are standard and premium Databricks pricing tiers .
For this scenario, the standard pricing tier is sufficient. However, if your specific
application requires automatically scaling clusters to handle larger workloads or
interactive Databricks dashboards, the premium level could increase costs further.

The solution notebooks can run on any Spark-based platform with minimal edits to
remove the Databricks-specific packages. See the following similar solutions for various
Azure platforms:

SQL Server R services


PySpark on an Azure Data Science Virtual Machine

Deploy this scenario


To deploy this reference architecture, follow the steps described in the GitHub
repository to build a scalable solution for scoring Spark models in batch on Azure
Databricks.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

John Ehrlinger | Senior Applied Scientist

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Perform data science with Azure Databricks
Deploy batch inference pipelines with Azure Machine Learning
Tutorial: Build an Azure Machine Learning pipeline for batch scoring

Related resources
Build a Real-time Recommendation API on Azure
Batch scoring for deep learning models using Azure Machine Learning pipelines
Batch scoring of Python Models on Azure
Build an enterprise-grade
conversational bot
Azure AI Bot Service Azure AI services

This reference architecture describes how to build an enterprise-grade conversational


bot (chatbot) using the Azure Bot Framework .

Architecture
Security and governance
Azure AD Key Vault
Identity and Keys and secrets
access control

2 3
Authenticate,
Token, keys
authorize Data ETL Raw data
Bot logic and UX Bot cognition and intelligence Azure Functions
1 Request Bot Service 4 Queries LUIS Web App Custom serverless
Input channels, Custom model for compute Structured
Intents and entities Data Data
authentication unstructured Q&A CRM, SQL, etc.
Data Factory
Bot Logic QnA Maker Scheduled ETL
User 7 Response 5 Results Azure Search
pipelines
Custom bot code Knowledge base Search index
(FAQs)
Unstructured
Logic Apps
Improved bot intelligence FAQs, PDFs,
Data connectors DOCs, etc.
6
End-to-end testing, Quality assurance and enhancements
Conversations,
new features
feedback, logs Enhanced data collection
Azure DevOps, VS Code

Logs Retraining, testing,


Logging debugging, feedback
analysis Bot testing utility
Application Insights
Web and system
logs Monitoring and reporting
Bot logging utility Cosmos DB Application Insights Power BI
Conversations, Logs Dashboards and Dashboards and Emails, alerts
Bot feedback feedback analytics analytics User initiated
utility
Azure Storage System initiated
SendGrid Web App DevOps
Conversations, Email service Custom dashboard Best-practice
feedback, logs utility sample

Download a Visio file of this architecture.

Workflow
The architecture shown here uses the following Azure services. Your own bot may not
use all of these services, or may incorporate additional services.

Bot logic and user experience


Bot Framework Service (BFS). This service connects your bot to a communication
app such as Cortana, Facebook Messenger, or Slack. It facilitates communication
between your bot and the user.
Azure App Service. The bot application logic is hosted in Azure App Service.

Bot cognition and intelligence


Language Understanding (LUIS). Part of Azure Cognitive Services, LUIS enables
your bot to understand natural language by identifying user intents and entities.
Azure Search. Search is a managed service that provides a quick searchable
document index.
QnA Maker. QnA Maker is a cloud-based API service that creates a conversational,
question-and-answer layer over your data. Typically, it's loaded with semi-
structured content such as FAQs. Use it to create a knowledge base for answering
natural-language questions.
Web app. If your bot needs AI solutions not provided by an existing service, you
can implement your own custom AI and host it as a web app. This provides a web
endpoint for your bot to call.

Data ingestion
The bot will rely on raw data that must be ingested and prepared. Consider any of the
following options to orchestrate this process:

Azure Data Factory. Data Factory orchestrates and automates data movement and
data transformation.
Logic Apps. Logic Apps is a serverless platform for building workflows that
integrate applications, data, and services. Logic Apps provides data connectors for
many applications, including Office 365.
Azure Functions. You can use Azure Functions to write custom serverless code that
is invoked by a trigger — for example, whenever a document is added to blob
storage or Azure Cosmos DB.

Logging and monitoring

Application Insights. Use Application Insights to log the bot's application metrics
for monitoring, diagnostic, and analytical purposes.
Azure Blob Storage. Blob storage is optimized for storing massive amounts of
unstructured data.
Azure Cosmos DB. Azure Cosmos DB is well-suited for storing semi-structured log
data such as conversations.
Power BI. Use Power BI to create monitoring dashboards for your bot.
Security and governance
Microsoft Entra ID. Users will authenticate through an identity provider such as
Microsoft Entra ID. The Bot Service handles the authentication flow and OAuth
token management. See Add authentication to your bot via Azure Bot Service.
Azure Key Vault. Store credentials and other secrets using Key Vault.

Quality assurance and enhancements


Azure DevOps . Provides many services for app management, including source
control, building, testing, deployment, and project tracking.
VS Code . A lightweight code editor for app development. You can use any other
IDE with similar features.

Components
Bot Framework Service
Azure App Service
Azure Cognitive Services
Azure Search
Azure Data Factory
Azure Logic Apps
Azure Functions
Application Insights is a feature of Azure Monitor
Azure Blob Storage
Azure Cosmos DB
Microsoft Entra ID
Azure Key Vault

Scenario details
Each bot is different, but there are some common patterns, workflows, and technologies
to be aware of. Especially for a bot to serve enterprise workloads, there are many design
considerations beyond just the core functionality.

The best practice utility samples used in this architecture are fully open-sourced and
available on GitHub .

Potential use cases


This solution is ideal for the telecommunications industry. This article covers the most
essential design aspects, and introduces the tools needed to build a robust, secure, and
actively learning bot.

Recommendations
At a high level, a conversational bot can be divided into the bot functionality (the
"brain") and a set of surrounding requirements (the "body"). The brain includes the
domain-aware components, including the bot logic and ML capabilities. Other
components are domain agnostic and address non-functional requirements such as
CI/CD, quality assurance, and security.

Before getting into the specifics of this architecture, let's start with the data flow
through each subcomponent of the design. The data flow includes user-initiated and
system-initiated data flows.

User message flow


Authentication. Users start by authenticating themselves using whatever mechanism is
provided by their channel of communication with the bot. The bot framework supports
many communication channels, including Cortana, Microsoft Teams, Facebook
Messenger, Kik, and Slack. For a list of channels, see Connect a bot to channels. When
you create a bot with Azure Bot Service, the Web Chat channel is automatically
configured. This channel allows users to interact with your bot directly in a web page.
You can also connect the bot to a custom app by using the Direct Line channel. The
user's identity is used to provide role-based access control, as well as to serve
personalized content.

User message. Once authenticated, the user sends a message to the bot. The bot reads
the message and routes it to a natural language understanding service such as LUIS.
This step gets the intents (what the user wants to do) and entities (what things the user
is interested in). The bot then builds a query that it passes to a service that serves
information, such as Azure Search for document retrieval, QnA Maker for FAQs, or a
custom knowledge base. The bot uses these results to construct a response. To give the
best result for a given query, the bot might make several back-and-forth calls to these
remote services.

Response. At this point, the bot has determined the best response and sends it to the
user. If the confidence score of the best-matched answer is low, the response might be a
disambiguation question or an acknowledgment that the bot could not reply
adequately.

Logging. When a user request is received or a response is sent, all conversation actions
should be logged to a logging store, along with performance metrics and general errors
from external services. These logs will be useful later when diagnosing issues and
improving the system.

Feedback. Another good practice is to collect user feedback and satisfaction scores. As a
follow up to the bot's final response, the bot should ask the user to rate their
satisfaction with the reply. Feedback can help you to solve the cold start problem of
natural language understanding, and continually improve the accuracy of responses.

System Data Flow


ETL. The bot relies on information and knowledge extracted from the raw data by an ETL
process in the backend. This data might be structured (SQL database), semi-structured
(CRM system, FAQs), or unstructured (Word documents, PDFs, web logs). An ETL
subsystem extracts the data on a fixed schedule. The content is transformed and
enriched, then loaded into an intermediary data store, such as Azure Cosmos DB or
Azure Blob Storage.

Data in the intermediary store is then indexed into Azure Search for document retrieval,
loaded into QnA Maker to create question and answer pairs, or loaded into a custom
web app for unstructured text processing. The data is also used to train a LUIS model for
intent and entity extraction.

Quality assurance. The conversation logs are used to diagnose and fix bugs, provide
insight into how the bot is being used, and track overall performance. Feedback data is
useful for retraining the AI models to improve bot performance.

Building a bot
Before you even write a single line of code, it's important to write a functional
specification so the development team has a clear idea of what the bot is expected to
do. The specification should include a reasonably comprehensive list of user inputs and
expected bot responses in various knowledge domains. This living document will be an
invaluable guide for developing and testing your bot.

Ingest data

Next, identify the data sources that will enable the bot to interact intelligently with users.
As mentioned earlier, these data sources could contain structured, semi-structured, or
unstructured data sets. When you're getting started, a good approach is to make a one-
off copy of the data to a central store, such as Azure Cosmos DB or Azure Storage. As
you progress, you should create an automated data ingestion pipeline to keep this data
current. Options for an automated ingestion pipeline include Data Factory, Functions,
and Logic Apps. Depending on the data stores and the schemas, you might use a
combination of these approaches.

As you get started, it's reasonable to use the Azure portal to manually create Azure
resources. Later on, you should put more thought into automating the deployment of
these resources.

Core bot logic and UX

Once you have a specification and some data, it's time to start making your bot into
reality. Let's focus on the core bot logic. This is the code that handles the conversation
with the user, including the routing logic, disambiguation logic, and logging. Start by
familiarizing yourself with the Bot Framework , including:

Basic concepts and terminology used in the framework, especially conversations,


turns, and activities.
The Bot Connector service, which handles the networking between the bot and
your channels.
How conversation state is maintained, either in memory or better yet in a store
such as Azure Blob Storage or Azure Cosmos DB.
Middleware, and how it can be used to hook up your bot with external services,
such as Cognitive Services.
For a rich user experience, there are many options.

You can use cards to include buttons, images, carousels, and menus.
A bot can support speech.
You can even embed your bot in an app or website and use the capabilities of the
app hosting it.

To get started, you can build your bot online using the Azure Bot Service, selecting from
the available C# and Node.js templates. As your bot gets more sophisticated, however,
you will need to create your bot locally then deploy it to the web. Choose an IDE, such
as Visual Studio or Visual Studio Code, and a programming language. SDKs are available
for the following languages:

C#
JavaScript
Java (preview)
Python (preview)

As a starting point, you can download the source code for the bot you created using the
Azure Bot Service. You can also find sample code , from simple echo bots to more
sophisticated bots that integrate with various AI services.

Add smarts to your bot


For a simple bot with a well-defined list of commands, you might be able to use a rules-
based approach to parse the user input via regex. This has the advantage of being
deterministic and understandable. However, when your bot needs to understand the
intents and entities of a more natural-language message, there are AI services that can
help.

LUIS is specifically designed to understand user intents and entities. You train it
with a moderately sized collection of relevant user input and desired responses,
and it returns the intents and entities for a user's given message.

Azure Search can work alongside LUIS. Using Search, you create searchable indexes
over all relevant data. The bot queries these indexes for the entities extracted by
LUIS. Azure Search also supports synonyms, which can widen the net of correct
word mappings.

QnA Maker is another service that is designed to return answers for given
questions. It's typically trained over semi-structured data such as FAQs.
Your bot can use other AI services to further enrich the user experience. The Cognitive
Services suite of pre-built AI services (which includes LUIS and QnA Maker) has
services for vision, speech, language, search, and location. You can quickly add
functionality such as language translation, spell checking, sentiment analysis, OCR,
location awareness, and content moderation. These services can be wired up as
middleware modules in your bot to interact more naturally and intelligently with the
user.

Another option is to integrate your own custom AI service. This approach is more
complex, but gives you complete flexibility in terms of the machine learning algorithm,
training, and model. For example, you could implement your own topic modeling and
use algorithm such as LDA to find similar or relevant documents. A good approach is
to expose your custom AI solution as a web service endpoint, and call the endpoint from
the core bot logic. The web service could be hosted in App Service or in a cluster of
VMs. Azure Machine Learning provides a number of services and libraries to assist you
in training and deploying your models.

Quality assurance and enhancement


Logging. Log user conversations with the bot, including the underlying performance
metrics and any errors. These logs will prove invaluable for debugging issues,
understanding user interactions, and improving the system. Different data stores might
be appropriate for different types of logs. For example, consider Application Insights for
web logs, Azure Cosmos DB for conversations, and Azure Storage for large payloads.
See Write directly to Azure Storage.

Feedback. It's also important to understand how satisfied users are with their bot
interactions. If you have a record of user feedback, you can use this data to focus your
efforts on improving certain interactions and retraining the AI models for improved
performance. Use the feedback to retrain the models, such as LUIS, in your system.

Testing. Testing a bot involves unit tests, integration tests, regression tests, and
functional tests. For testing, we recommend recording real HTTP responses from
external services, such as Azure Search or QnA Maker, so they can be played back
during unit testing without needing to make real network calls to external services.

7 Note

To jump-start your development in these areas, look at the Botbuilder Utils for
JavaScript . This repo contains sample utility code for bots built with Microsoft
Bot Framework v4 and running Node.js. It includes the following packages:
Azure Cosmos DB Logging Store . Shows how to store and query bot logs
in Azure Cosmos DB.
Application Insights Logging Store . Shows how to store and query bot logs
in Application Insights.
Feedback Collection Middleware . Sample middleware that provides a bot
user feedback-request mechanism.
Http Test Recorder . Records HTTP traffic from services external to the bot. It
comes pre-built with support for LUIS, Azure Search, and QnAMaker, but
extensions are available to support any service. This helps you automate bot
testing.

These packages are provided as utility sample code, and come with no guarantee
of support or updates.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Availability
As you roll out new features or bug fixes to your bot, it's best to use multiple
deployment environments, such as staging and production. Using deployment slots
from Azure DevOps allows you to do this with zero downtime. You can test your latest
upgrades in the staging environment before swapping them to the production
environment. In terms of handling load, App Service is designed to scale up or out
manually or automatically. Because your bot is hosted in Microsoft's global datacenter
infrastructure, the App Service SLA promises high availability.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

As with any other application, the bot can be designed to handle sensitive data.
Therefore, restrict who can sign in and use the bot. Also limit which data can be
accessed, based on the user's identity or role. Use Microsoft Entra ID for identity and
access control and Key Vault to manage keys and secrets.
DevOps

Monitoring and reporting


Once your bot is running in production, you will need a DevOps team to keep it that
way. Continually monitor the system to ensure the bot operates at peak performance.
Use the logs sent to Application Insights or Azure Cosmos DB to create monitoring
dashboards, either using Application Insights itself, Power BI, or a custom web app
dashboard. Send alerts to the DevOps team if critical errors occur or performance falls
below an acceptable threshold.

Automated resource deployment


The bot itself is only part of a larger system that provides it with the latest data and
ensures its proper operation. All of these other Azure resources — data orchestration
services such as Data Factory, storage services such as Azure Cosmos DB, and so forth —
must be deployed. Azure Resource Manager provides a consistent management layer
that you can access through the Azure portal, PowerShell, or the Azure CLI. For speed
and consistency, it's best to automate your deployment using one of these approaches.

Continuous bot deployment


You can deploy the bot logic directly from your IDE or from a command line, such as the
Azure CLI. As your bot matures, however, it's best to use a continual deployment
process using a CI/CD solution such as Azure DevOps, as described in the article Set up
continuous deployment. This is a good way to ease the friction in testing new features
and fixes in your bot in a near-production environment. It's also a good idea to have
multiple deployment environments, typically at least staging and production. Azure
DevOps supports this approach.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

Use the Azure pricing calculator to estimate costs. Here are some other
considerations.

Bot application
In this architecture, the main cost driver is the Azure App Service in which the bot
application logic is hosted. Choose an App Service plan tier that best suits your needs.
Here are some recommendations:

Use Free and Shared (preview) tiers for testing purposes because the shared
resources cannot scale out.
Run your production workload on Basic, Standard, and Premium tiers because the
app runs on dedicated virtual machine instances and has allocated resources that
can scale out. App Service plans are billed on a per second basis.

You are charged for the instances in the App Service plan, even when the app is
stopped. Delete plans that you don't intend to use long term, such as test deployments.

For more information, see How much does my App Service plan cost?.

Data ingestion
Azure Data Factory

In this architecture, Data Factory automates the data ingestion pipeline. Explore a
range of data integration capabilities to fit your budget needs, from managed SQL
Server Integration Services for seamless migration of SQL Server projects to the
cloud (cost effective option), to large-scale, serverless data pipelines for integrating
data of all shapes and sizes.

For an example, see Azure Data Factory - example cost analysis.

Azure Functions

In this reference architecture, Azure Functions is billed as per the Consumption


plan. You are charged based on per-second resource consumption and each time
an event triggers the execution of the function. Processing several events in a
single execution or batches can reduce cost.

Azure scales the infrastructure required to run functions as needed. When


workload is low, the infrastructure is scaled down up to zero with no associated
cost. Whenever the workload grows, Azure uses enough capacity to serve all the
demand. Because you pay per actual use, manage the exact cost of each
component.

Logic Apps
Logic apps pricing works on the pay-as-you-go model. Logic apps have a pay-as-you-
go pricing model. Triggers, actions, and connector executions are metered each time a
logic app runs. All successful and unsuccessful actions, including triggers, are considered
as executions.

For instance, your logic app processes 1000 messages a day from Azure Service Bus. A
workflow of five actions will cost less than $6. For more information, see Logic Apps
pricing .

For other cost considerations, see the Cost section in Microsoft Azure Well-Architected
Framework.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Robert Alexander | Senior Software Engineer


Abhinav Mithal | Principal Software Engineer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Review the Virtual Assistant template to quickly get started building conversational
bots.

Product documentation:

Language Understanding (LUIS)


Azure App Service
QnA Maker
Azure Data Factory
What is Azure Logic Apps?
Azure Functions

Microsoft Learn Training modules:

Create Intelligent Bots with the Azure Bot Service


Build a bot with QnA Maker and Azure Bot Service
Related resources
Speech-to-text conversion
Build a real-time recommendation
API on Azure
Azure Cosmos DB Azure Databricks Azure Kubernetes Service (AKS) Azure Machine Learning

This reference architecture shows how to train a recommendation model by using Azure
Databricks, and then deploy the model as an API by using Azure Cosmos DB, Azure
Machine Learning, and Azure Kubernetes Service (AKS). For a reference implementation
of this architecture see Building a Real-time Recommendation API on GitHub.

Architecture

Download a Visio file of this architecture.

This reference architecture is for training and deploying a real-time recommender


service API that can provide the top 10 movie recommendations for a user.

Dataflow
1. Track user behaviors. For example, a back-end service might log when a user rates
a movie or clicks a product or news article.
2. Load the data into Azure Databricks from an available data source.
3. Prepare the data and split it into training and testing sets to train the model. (This
guide describes options for splitting data.)
4. Fit the Spark Collaborative Filtering model to the data.
5. Evaluate the quality of the model using rating and ranking metrics. (This guide
provides details about the metrics that you can use to evaluate your
recommender.)
6. Precompute the top 10 recommendations per user and store as a cache in Azure
Cosmos DB.
7. Deploy an API service to AKS using the Machine Learning APIs to containerize and
deploy the API.
8. When the back-end service gets a request from a user, call the recommendations
API hosted in AKS to get the top 10 recommendations and display them to the
user.

Components
Azure Databricks . Databricks is a development environment used to prepare
input data and train the recommender model on a Spark cluster. Azure Databricks
also provides an interactive workspace to run and collaborate on notebooks for
any data processing or machine learning tasks.
Azure Kubernetes Service (AKS). AKS is used to deploy and operationalize a
machine learning model service API on a Kubernetes cluster. AKS hosts the
containerized model, providing scalability that meets your throughput
requirements, identity and access management, and logging and health
monitoring.
Azure Cosmos DB . Azure Cosmos DB is a globally distributed database service
used to store the top 10 recommended movies for each user. Azure Cosmos DB is
well-suited for this scenario, because it provides low latency (10 ms at 99th
percentile) to read the top recommended items for a given user.
Machine Learning . This service is used to track and manage machine learning
models, and then package and deploy these models to a scalable AKS
environment.
Microsoft Recommenders . This open-source repository contains utility code and
samples to help users get started in building, evaluating, and operationalizing a
recommender system.

Scenario details
This architecture can be generalized for most recommendation engine scenarios,
including recommendations for products, movies, and news.

Potential use cases


Scenario: A media organization wants to provide movie or video recommendations to
its users. By providing personalized recommendations, the organization meets several
business goals, including increased click-through rates, increased engagement on its
website, and higher user satisfaction.

This solution is optimized for the retail industry and for the media and entertainment
industries.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Batch scoring of Spark models on Azure Databricks describes a reference architecture


that uses Spark and Azure Databricks to execute scheduled batch scoring processes. We
recommend this approach for generating new recommendations.

Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.

Performance is a primary consideration for real-time recommendations, because


recommendations usually fall in the critical path of a user request on your website.

The combination of AKS and Azure Cosmos DB enables this architecture to provide a
good starting point to provide recommendations for a medium-sized workload with
minimal overhead. Under a load test with 200 concurrent users, this architecture
provides recommendations at a median latency of about 60 ms and performs at a
throughput of 180 requests per second. The load test was run against the default
deployment configuration (a 3x D3 v2 AKS cluster with 12 vCPUs, 42 GB of memory, and
11,000 Request Units (RUs) per second provisioned for Azure Cosmos DB).
Azure Cosmos DB is recommended for its turnkey global distribution and usefulness in
meeting any database requirements your app has. To reduce latency slightly, consider
using Azure Cache for Redis instead of Azure Cosmos DB to serve lookups. Azure Cache
for Redis can improve performance of systems that rely heavily on data in back-end
stores.

Scalability
If you don't plan to use Spark, or you have a smaller workload that doesn't need
distribution, consider using a Data Science Virtual Machine (DSVM) instead of Azure
Databricks. A DSVM is an Azure virtual machine with deep learning frameworks and
tools for machine learning and data science. As with Azure Databricks, any model you
create in a DSVM can be operationalized as a service on AKS via Machine Learning.
During training, either provision a larger fixed-size Spark cluster in Azure Databricks, or
configure autoscaling. When autoscaling is enabled, Databricks monitors the load on
your cluster and scales up and down as needed. Provision or scale out a larger cluster if
you have a large data size and you want to reduce the amount of time it takes for data
preparation or modeling tasks.

Scale the AKS cluster to meet your performance and throughput requirements. Take care
to scale up the number of pods to fully utilize the cluster, and to scale the nodes of the
cluster to meet the demand of your service. You can also set autoscaling on an AKS
cluster. For more information, see Deploy a model to an Azure Kubernetes Service
cluster.

To manage Azure Cosmos DB performance, estimate the number of reads required per
second, and provision the number of RUs per second (throughput) needed. Use best
practices for partitioning and horizontal scaling.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

The main drivers of cost in this scenario are:

The Azure Databricks cluster size required for training.


The AKS cluster size required to meet your performance requirements.
Azure Cosmos DB RUs provisioned to meet your performance requirements.

Manage the Azure Databricks costs by retraining less frequently and turning off the
Spark cluster when not in use. The AKS and Azure Cosmos DB costs are tied to the
throughput and performance required by your site and will scale up and down
depending on the volume of traffic to your site.

Deploy this scenario


To deploy this architecture, follow the Azure Databricks instructions in the setup
document . Briefly, the instructions require you to:

1. Create an Azure Databricks workspace.


2. Create a new cluster with the following configuration in Azure Databricks:

Cluster mode: Standard


Databricks runtime version: 4.3 (includes Apache Spark 2.3.1, Scala 2.11)
Python version: 3
Driver type: Standard_DS3_v2
Worker type: Standard_DS3_v2 (min and max as required)
Auto termination: (as required)
Spark configuration: (as required)
Environment variables: (as required)
3. Create a personal access token within the Azure Databricks workspace. See the
Azure Databricks authentication documentation for details.
4. Clone the Microsoft Recommenders repository into an environment where you
can execute scripts (for example, your local computer).
5. Follow the Quick install setup instructions to install the relevant libraries on
Azure Databricks.
6. Follow the Quick install setup instructions to prepare Azure Databricks for
operationalization .
7. Import the ALS Movie Operationalization notebook into your workspace. After
logging into your Azure Databricks workspace, do the following:
a. Click Home on the left side of the workspace.
b. Right-click on white space in your home directory. Select Import.
c. Select URL, and paste the following into the text field:
https://github.com/Microsoft/Recommenders/blob/master/examples/05_operation

alize/als_movie_o16n.ipynb
d. Click Import.
8. Open the notebook within Azure Databricks and attach the configured cluster.
9. Run the notebook to create the Azure resources required to create a
recommendation API that provides the top-10 movie recommendations for a given
user.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Miguel Fierro | Principal Data Scientist Manager


Nikhil Joglekar | Product Manager, Azure algorithms and data science

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Building a Real-time Recommendation API
What is Azure Databricks?
Azure Kubernetes Service
Welcome to Azure Cosmos DB
What is Azure Machine Learning?

Related resources
Batch scoring of Spark models on Azure Databricks
Build a content-based recommendation system
Personalization using Cosmos DB
Retail assistant with visual capabilities
Create personalized marketing solutions in near real time
Personalized offers
Build and deploy a social media
analytics solution
Azure AI services Azure Synapse Analytics Azure Machine Learning Azure Data Lake Power BI Embedded

To best address customer needs, organizations need to extract insights from social
media about their customers. This article presents a solution for analyzing news and
social media data. The solution extends the Azure Social Media Analytics Solution
Accelerator , which gives developers the resources needed to build and deploy a social
media monitoring platform on Azure in a few hours. That platform collects social media
and website data and presents the data in a format that supports the business decision–
making process.

Apache®, Apache Spark , and the flame logo are either registered trademarks or
trademarks of the Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is implied by the use of
these marks.

Architecture

Download a Visio file of this architecture.

Dataflow
1. Azure Synapse Analytics pipelines ingest external data and store that data in Azure
Data Lake. One pipeline ingests data from news APIs. The other pipeline ingests
data from the Twitter API.

2. Apache Spark pools in Azure Synapse Analytics are used to process and enrich the
data.

3. The Spark pools use the following services:

Azure Cognitive Service for Language, for named entity recognition (NER),
key phrase extraction, and sentiment analysis
Azure Cognitive Services Translator, to translate text
Azure Maps, to link data to geographical coordinates

4. The enriched data is stored in Data Lake.

5. A serverless SQL pool in Azure Synapse Analytics makes the enriched data
available to Power BI.

6. Power BI Desktop dashboards provide insights into the data.

7. As an alternative to the previous step, Power BI dashboards that are embedded in


Azure App Service web apps provide web and mobile app users with insights into
the data.

8. As an alternative to steps 5 through 7, the enriched data is used to train a custom


machine learning model in Azure Machine Learning.

9. The model is deployed to a Machine Learning endpoint.

10. A managed online endpoint is used for online, real-time inferencing, for instance,
on a mobile app (A). Alternatively, a batch endpoint is used for offline model
inferencing (B).

Components
Azure Synapse Analytics is an integrated analytics service that accelerates time
to insight across data warehouses and big data systems.

Cognitive Service for Language consists of cloud-based services that provide AI


functionality. You can use the REST APIs and client library SDKs to build cognitive
intelligence into apps even if you don't have AI or data science skills. Features
include:
Named entity recognition (NER) for identifying and categorizing people, places,
organizations, and quantities in unstructured text.
Key phrase extraction for identifying key talking points in a post or an article.
Sentiment analysis for providing insight into the sentiment of posts by
detecting positive, negative, neutral, and mixed-sentiment content.

Translator helps you to translate text instantly or in batches across more than
100 languages. This service uses the latest innovations in machine translation.
Translator supports a wide range of use cases, such as translation for call centers,
multilingual conversational agents, and in-app communication. For the languages
that Translator supports, see Translation.

Azure Maps is a suite of geospatial services that help you incorporate location-
based data into web and mobile solutions. You can use the location and map data
to generate insights, inform data-driven decisions, enhance security, and improve
customer experiences. This solution uses Azure Maps to link news and posts to
geographical coordinates.

Data Lake is a massively scalable data lake for high-performance analytics


workloads.

App Service provides a framework for building, deploying, and scaling web apps.
The Web Apps feature is a service for hosting web applications, REST APIs, and
mobile back ends.

Machine Learning is a cloud-based environment that you can use to train,


deploy, automate, manage, and track machine learning models.

Power BI is a collection of analytics services and apps. You can use Power BI to
connect and display unrelated sources of data.

Alternatives
You can simplify this solution by eliminating Machine Learning and the custom machine
learning models, as the following diagram shows. For more information, see Deploy this
scenario, later in this article.

Download a Visio file of this architecture.

Scenario details
Marketing campaigns are about more than the message that you deliver. When and
how you deliver that message is just as important. Without a data-driven, analytical
approach, campaigns can easily miss opportunities or struggle to gain traction. Those
campaigns are often based on social media analysis, which has become increasingly
important for companies and organizations around the world. Social media analysis is a
powerful tool that you can use to receive instant feedback on products and services,
improve interactions with customers to increase customer satisfaction, keep up with the
competition, and more. Companies often lack efficient, viable ways to monitor social
media conversations. As a result, they miss opportunities to use these insights to inform
their strategies and plans.

This article's solution benefits a wide spectrum of social media and news analysis
applications. By deploying the solution instead of manually deploying its resources, you
can reduce your time to market. You can also:

Extract news and Twitter posts about a specific subject.


Translate the extracted text to your preferred language.
Extract key points and entities from the news and posts.
Identify the sentiment about the subject.

For instance, to see the latest discussions about Satya Nadella, you enter his name in a
query. The solution then accesses news APIs and the Twitter API to provide information
about him from around the web.
Potential use cases
By extracting information about your customers from social media, you can enhance
customer experiences, increase customer satisfaction, gain new leads, and prevent
customer churn. These applications of social media analytics fall into three main areas:

Measuring brand health:


Capturing customer reactions and feedback for new products or services on
social media
Analyzing sentiment on social media interactions for a new product or service
Capturing the sentiment about a brand and determining whether the overall
perception is positive or negative

Building and maintaining customer relationships:


Quickly identifying customer concerns
Listening to untagged brand mentions

Optimizing marketing investments:


Extracting insights from social media for campaign analysis
Doing targeted marketing optimization
Reaching a wider audience by finding new leads and influencers

Marketing is an integral part of every organization. As a result, you can use this social
media analytics solution for these use cases in various industries:

Retail
Finance
Manufacturing
Healthcare
Government
Energy
Telecommunications
Automotive
Nonprofit
Gaming
Media and entertainment
Travel, including hospitality and restaurants
Facilities, including real estate
Sports

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.

Use Azure Monitor and Application Insights to monitor the health of Azure
resources.
Review the following resiliency considerations before you implement this solution:
Azure Synapse Analytics
App Service
For more information about resiliency in Azure, see Design reliable Azure
applications.
For availability guarantees of various Azure components, see the following service
level agreements (SLAs):
SLA for Azure Synapse Analytics
SLA for Storage Accounts
SLA for Azure Maps
SLA for Azure Cognitive Services
SLA for Azure Machine Learning
SLA for App Service

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

To estimate the cost of this solution, use the Azure pricing calculator .

Operational excellence
Operational excellence covers the operations processes that deploy an application and
keep it running in production. For more information, see Overview of the operational
excellence pillar.

Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.

For information about Spark pool scaling and node sizes, see Apache Spark pool
configurations in Azure Synapse Analytics.
You can scale Machine Learning training pipelines up and down based on data size
and other configuration parameters.
Serverless SQL pools are available on demand. They don't require scaling up,
down, in, or out.
Azure Synapse Analytics supports Apache Spark 3.1.2, which delivers significant
performance improvements over its predecessors .

Deploy this scenario


To deploy this solution and run a sample social media analytics scenario, see the
deployment guide in Getting Started . That guide helps you set up the Social Media
Analytics Solution Accelerator resources, which the architecture diagram in
Alternatives shows. The deployment doesn't include the following components: Machine
Learning, the managed endpoints, and the App Service web app.

Prerequisites
To use the solution accelerator, you need access to an Azure subscription .
A basic understanding of Azure Synapse Analytics, Azure Cognitive Services, Azure
Maps, and Power BI is helpful but not required.
A news API account is required.
A Twitter developer account with Elevated access to Twitter API features is
required.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributor.
Principal author:

Christina Skarpathiotaki | AI Specialized Cloud Solution Architect

Next steps
What is Azure Synapse Analytics?
Azure Machine Learning documentation
What are Azure Cognitive Services?
Azure Cognitive Service for Language documentation
What is Azure Cognitive Services Translator?
What is Azure Maps?
What is Power BI?
Tutorial: Sentiment analysis with Cognitive Services in Azure Synapse Analytics
Tutorial: Text Analytics with Cognitive Service in Azure Synapse Analytics

Related resources
Artificial intelligence (AI) architecture design
Choose a Microsoft cognitive services technology
Optimize marketing with machine learning
Spaceborne data analysis with Azure Synapse Analytics
Implement logging and
monitoring for Azure OpenAI
models
Azure AI services Azure API Management Azure Monitor Azure Active Directory

This solution provides comprehensive logging and monitoring and enhanced security
for enterprise deployments of the Azure OpenAI Service API. The solution enables
advanced logging capabilities for tracking API usage and performance and robust
security measures to help protect sensitive data and help prevent malicious activity.

Architecture

Download a Visio file of this architecture.

Workflow
1. Client applications access Azure OpenAI endpoints to perform text generation
(completions) and model training (fine-tuning).

2. Azure Application Gateway provides a single point of entry to Azure OpenAI


models and provides load balancing for APIs.

7 Note
Load balancing of stateful operations like model fine-tuning, deployments,
and inference of fine-tuned models isn't supported.

3. Azure API Management enables security controls and auditing and monitoring of
the Azure OpenAI models.
a. In API Management, enhanced-security access is granted via Microsoft Entra
groups with subscription-based access permissions.
b. Auditing is enabled for all interactions with the models via Azure Monitor
request logging.
c. Monitoring provides detailed Azure OpenAI model usage KPIs and metrics,
including prompt information and token statistics for usage traceability.

4. API Management connects to all Azure resources via Azure Private Link. This
configuration provides enhanced security for all traffic via private endpoints and
contains traffic in the private network.

5. Multiple Azure OpenAI instances enable scale-out of API usage to ensure high
availability and disaster recovery for the service.

Components
Application Gateway . Application load balancer to help ensure that all users of
the Azure OpenAI APIs get the fastest response and highest throughput for model
completions.
API Management . API management platform for accessing back-end Azure
OpenAI endpoints. Provides monitoring and logging that's not available natively in
Azure OpenAI.
Azure Virtual Network . Private network infrastructure in the cloud. Provides
network isolation so that all network traffic for models is routed privately to Azure
OpenAI.
Azure OpenAI . Service that hosts models and provides generative model
completion outputs.
Monitor . End-to-end observability for applications. Provides access to
application logs via Kusto Query Language. Also enables dashboard reports and
monitoring and alerting capabilities.
Azure Key Vault . Enhanced-security storage for keys and secrets that are used by
applications.
Azure Storage . Application storage in the cloud. Provides Azure OpenAI with
accessibility to model training artifacts.
Microsoft Entra ID . Enhanced-security identity manager. Enables user
authentication and authorization to the application and to platform services that
support the application. Also provides Group Policy to ensure that the principle of
least privilege is applied to all users.

Alternatives
Azure OpenAI provides native logging and monitoring. You can use this native
functionality to track telemetry of the service, but the default cognitive service logging
doesn't track or record inputs and outputs of the service, like prompts, tokens, and
models. These metrics are especially important for compliance and to ensure that the
service operates as expected. Also, by tracking interactions with the large language
models deployed to Azure OpenAI, you can analyze how your organization is using the
service to identify cost and usage patterns that can help inform decisions on scaling and
resource allocation.

The following table provides a comparison of the metrics provided by the default Azure
OpenAI logging and those provided by this solution.

Metric Default Azure OpenAI This solution


logging

Request count x x

Data in (size) / data out x x


(size)

Latency x x

Token transactions (total) x x

Caller IP address x (last octet masked) x

Model utilization x

Token utilization x x
(input/output)

Input prompt detail x (limited to 8,192 response


characters)

Output completion detail x (limited to 8,192 response


Metric Default Azure OpenAI This solution
logging

characters)

Deployment operations x x

Embedding operations x x (limited to 8,192 response


characters)

Scenario details
Large enterprises that use generative AI models need to implement auditing and
logging of the use of these models to ensure responsible use and corporate compliance.
This solution provides enterprise-level logging and monitoring for all interactions with
AI models to mitigate harmful use of the models and help ensure that security and
compliance standards are met. The solution integrates with existing APIs for Azure
OpenAI with little modification to take advantage of existing code bases. Administrators
can also monitor service usage for reporting.

The solution provides these advantages:

Comprehensive logging of Azure OpenAI model execution, tracked to the source


IP address. Log information includes text that users submit to the model and text
received back from the model. This logging helps ensure that models are used
responsibly and within the approved use cases of the service.
High availability of the model APIs to ensure that user requests are met even if the
traffic exceeds the limits of a single Azure OpenAI service.
Role-based access managed via Microsoft Entra ID to ensure that the principle of
least privilege is applied.

Example query for usage monitoring

ApiManagementGatewayLogs
| where OperationId == 'completions_create'
| extend modelkey = substring(parse_json(BackendResponseBody)['model'], 0,
indexof(parse_json(BackendResponseBody)['model'], '-', 0, -1, 2))
| extend model = tostring(parse_json(BackendResponseBody)['model'])
| extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])
['prompt_tokens']
| extend completiontokens = parse_json(parse_json(BackendResponseBody)
['usage'])['completion_tokens']
| extend totaltokens = parse_json(parse_json(BackendResponseBody)['usage'])
['total_tokens']
| extend ip = CallerIpAddress
| summarize
sum(todecimal(prompttokens)),
sum(todecimal(completiontokens)),
sum(todecimal(totaltokens)),
avg(todecimal(totaltokens))
by ip, model

Output:

Example query for prompt usage monitoring

ApiManagementGatewayLogs
| where OperationId == 'completions_create'
| extend model = tostring(parse_json(BackendResponseBody)['model'])
| extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])
['prompt_tokens']
| extend prompttext = substring(parse_json(parse_json(BackendResponseBody)
['choices'])[0], 0, 100)

Output:

Potential use cases


Deployment of Azure OpenAI for internal enterprise users to accelerate
productivity
High availability of Azure OpenAI for internal applications
Enhanced-security use of Azure OpenAI within regulated industries

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that you can use to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Reliability
Reliability ensures that your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.

This scenario ensures high availability of the large language models for your enterprise
users. The Azure application gateway provides an effective layer-7 application delivery
mechanism to ensure fast and consistent access to applications. You can use API
Management to configure, manage, and monitor access to your models. The inherent
high availability of platform services like Storage, Key Vault, and Virtual Network ensure
high reliability for your application. Finally, multiple instances of Azure OpenAI ensure
service resilience in case of application-level failures. These architecture components can
help you ensure the reliability of your application at enterprise scale.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

By implementing best practices for application-level and network-level isolation of your


cloud services, this scenario mitigates risks of data exfiltration and data leakage. All
network traffic containing potentially sensitive data that's input to the model is isolated
in a private network. This traffic doesn't traverse public internet routes. You can use
Azure ExpressRoute to further isolate network traffic to the corporate intranet and help
ensure end-to-end network security.

Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.

To help you explore the cost of running this scenario, we've preconfigured all the
services in the Azure pricing calculator. To learn how the pricing would change for your
use case, change the appropriate variables to match your expected traffic.

The following three sample cost profiles provide estimates based on the amount of
traffic. (The estimates assume that a document contains approximately 1,000 tokens.)

Small : For processing 10,000 documents per month.


Medium : For processing 100,000 documents per month.
Large : For processing 10 million documents per month.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Ashish Chauhan | Cloud Solution Architect – Data / AI


Jake Wang | Cloud Solution Architect – AI / Machine Learning

Other contributors:

Mick Alberts | Technical Writer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Azure OpenAI request form
Best practices for prompt engineering with OpenAI API
Azure OpenAI: Documentation, quickstarts, API reference
Azure-Samples/openai-python-enterprise-logging (GitHub)
Configure Azure Cognitive Services virtual networks

Related resources
Protect APIs with Azure Application Gateway and Azure API Management
Query-based document summarization
AI architecture design
Secure research environment for
regulated data
Azure Data Science Virtual Machines Azure Machine Learning Azure Data Factory

This architecture shows a secure research environment intended to allow researchers to


access sensitive data under a higher level of control and data protection. This article is
applicable for organizations that are bound by regulatory compliance or other strict
security requirements.

Architecture

Approver

Network
6 security
group
Logic app
4 3
Researcher
7 Approved Data science virtual Azure Virtual
data Desktop
machine
2 Managed
Data Factory identity
Blob Storage 5 Virtual Virtual
(private) network network
gateway gateway
8 Azure Machine
Learning compute

User-defined routing
Virtual network
1 table

Firewall
Data owner Blob Storage policy
(public) Key Vault Azure Firewall

Virtual network
Resource group Resource group

Azure Active Microsoft Azure


Defender for Cloud Azure Policy
Directory Sentinel Monitor

Download a Visio file of this architecture.

Dataflow
1. Data owners upload datasets into a public blob storage account. The data is
encrypted by using Microsoft-managed keys.

2. Azure Data Factory uses a trigger that starts copying of the uploaded dataset to a
specific location (import path) on another storage account with security controls.
The storage account can only be reached through a private endpoint. Also, it's
accessed by a service principal with limited permissions. Data Factory deletes the
original copy making the dataset immutable.

3. Researchers access the secure environment through a streaming application using


Azure Virtual Desktop as a privileged jump box.

4. The dataset in the secure storage account is presented to the data science VMs
provisioned in a secure network environment for research work. Much of the data
preparation is done on those VMs.

5. The secure environment has Azure Machine Learning compute that can access the
dataset through a private endpoint for users for AML capabilities, such as to train,
deploy, automate, and manage machine learning models. At this point, models are
created that meet regulatory guidelines. All model data is de-identified by
removing personal information.

6. Models or de-identified data is saved to a separate location on the secure storage


(export path). When new data is added to the export path, a Logic App is
triggered. In this architecture, the Logic App is outside the secure environment
because no data is sent to the Logic App. Its only function is to send notification
and start the manual approval process.

The app starts an approval process requesting a review of data that is queued to
be exported. The manual reviewers ensure that sensitive data isn't exported. After
the review process, the data is either approved or denied.

7 Note

If an approval step is not required on exfiltration, the Logic App step could be
omitted.

7. If the de-identified data is approved, it's sent to the Data Factory instance.

8. Data Factory moves the data to the public storage account in a separate container
to allow external researchers to have access to their exported data and models.
Alternately, you can provision another storage account in a lower security
environment.

Components
This architecture consists of several Azure cloud services that scale resources according
to need. The services and their roles are described below. For links to product
documentation to get started with these services, see Next steps.

Core workload components


Here are the core components that move and process research data.

Azure Data Science Virtual Machine (DSVM): VMs that are configured with
tools used for data analytics and machine learning.

Azure Machine Learning: Used to train, deploy, automate, and manage machine
learning models and to manage the allocation and use of ML compute resources.

Azure Machine Learning Compute: A cluster of nodes that are used to train and
test machine learning and AI models. The compute is allocated on demand based
on an automatic scaling option.

Azure Blob storage: There are two instances. The public instance is used to
temporarily store the data uploaded by data owners. Also, it stores deidentified
data after modeling in a separate container. The second instance is private. It
receives the training and test data sets from Machine Learning that are used by the
training scripts. Storage is mounted as a virtual drive onto each node of a Machine
Learning Compute cluster.

Azure Data Factory: Automatically moves data between storage accounts of


differing security levels to ensure separation of duties.

Azure Virtual Desktop is used as a jump box to gain access to the resources in
the secure environment with streaming applications and a full desktop, as needed.
Alternately, you can use Azure Bastion . But, have a clear understanding of the
security control differences between the two options. Virtual Desktop has some
advantages:
Ability to stream an app like VSCode to run notebooks against the machine
learning compute resources.
Ability to limit copy, paste, and screen captures.
Support for Microsoft Entra authentication to DSVM.

Azure Logic Apps provides automated low-code workflow to develop both the
trigger and release portions of the manual approval process.

Posture management components


These components continuously monitor the posture of the workload and its
environment. The purpose is to discover and mitigate risks as soon as they are
discovered.

Microsoft Defender for Cloud is used to evaluate the overall security posture of
the implementation and provide an attestation mechanism for regulatory
compliance. Issues that were previously found during audits or assessments can be
discovered early. Use features to track progress such as secure score and
compliance score.

Microsoft Sentinel is Security Information and Event Management (SIEM) and


security orchestration automated response (SOAR) solution. You can centrally view
logs and alerts from various sources and take advantage of advanced AI and
security analytics to detect, hunt, prevent, and respond to threats.

Azure Monitor provides observability across your entire environment. View


metrics, activity logs, and diagnostics logs from most of your Azure resources
without added configuration. Management tools, such as those in Microsoft
Defender for Cloud, also push log data to Azure Monitor.

Governance components
Azure Policy helps to enforce organizational standards and to assess compliance
at-scale.

Alternatives
This solution uses Data Factory to move the data to the public storage account in a
separate container, in order to allow external researchers to have access to their
exported data and models. Alternately, you can provision another storage account
in a lower security environment.
This solution uses Azure Virtual Desktop as a jump box to gain access to the
resources in the secure environment, with streaming applications and a full
desktop. Alternately, you can use Azure Bastion. But, Virtual Desktop has some
advantages, which include the ability to stream an app, to limit copy/paste and
screen captures, and to support AAC authentication. You can also consider
configuring Point to Site VPN for offline training locally. This will also help save
costs of having multiple VMs for workstations.
To secure data at rest, this solution encrypts all Azure Storage with Microsoft-
managed keys using strong cryptography. Alternately, you can use customer-
managed keys. The keys must be stored in a managed key store.
Scenario details

Potential use cases


This architecture was originally created for higher education research institutions with
HIPAA requirements. However, this design can be used in any industry that requires
isolation of data for research perspectives. Some examples include:

Industries that process regulated data as per NIST requirements


Medical centers collaborating with internal or external researchers
Banking and finance

By following the guidance you can maintain full control of your research data, have
separation of duties, and meet strict regulatory compliance standards while providing
collaboration between the typical roles involved in a research-oriented workload; data
owners, researchers, and approvers.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

The main objective of this architecture is to provide a secure and trusted research
environment that strictly limits the exfiltration of data from the secure area.

Network security
Azure resources that are used to store, test, and train research data sets are provisioned
in a secure environment. That environment is an Azure Virtual Network (VNet) that has
network security groups (NSGs) rules to restrict access, mainly:

Inbound and outbound access to the public internet and within the VNet.

Access to and from specific services and ports. For example, this architecture
blocks all ports ranges except the ones required for Azure Services (such as Azure
Monitor). A full list of Service Tags and the corresponding services can be found
here.

Also, access from VNet with Azure Virtual Desktop (AVD) on ports limited to
approved access methods is accepted, all other traffic is denied. When compared
to this environment, the other VNet (with AVD) is relatively open.

The main blob storage in the secure environment is off the public internet. It's only
accessible within the VNet through private endpoint connections and Azure Storage
Firewalls. It's used to limit the networks from which clients can connect to Azure file
shares.

This architecture uses credential-based authentication for the main data store that is in
the secure environment. In this case, the connection information like the subscription ID
and token authorization is stored in a key vault. Another option is to create identity-
based data access, where your Azure account is used to confirm if you have access to
the Storage service. In an identity-based data access scenario, no authentication
credentials are saved. For the details on how to use identity-based data access, see
Connect to storage by using identity-based data access.

The compute cluster can solely communicate within the virtual network, by using the
Azure Private Link ecosystem and service/private endpoints, rather than using Public IP
for communication. Make sure you enable No public IP. For details about this feature,
which is currently in preview (as of 3/7/2022), see No public IP for compute instances.

The secure environment uses Azure Machine Learning compute to access the dataset
through a private endpoint. Additionally, Azure Firewall can be used to control
outbound access from Azure Machine Learning compute. To learn about how to
configure Azure Firewall to control access to Azure Machine Learning compute, which
resides in a machine learning workspace, see Configure inbound and outbound network
traffic.

To learn one of the ways to secure an Azure Machine Learning environment, see the
blog post, Secure Azure Machine Learning Service (AMLS) Environment .

For Azure services that cannot be configured effectively with private endpoints, or to
provide stateful packet inspection, consider using Azure Firewall or a third-party
network virtual appliance (NVA).

Identity management

The Blob storage access is through Azure Role-based access controls (RBAC).
Azure Virtual Desktop supports Microsoft Entra authentication to DSVM.

Data Factory uses managed identity to access data from the blob storage. DSVMs also
uses managed identity for remediation tasks.

Data security
To secure data at rest, all Azure Storage is encrypted with Microsoft-managed keys using
strong cryptography.

Alternately, you can use customer-managed keys. The keys must be stored in a
managed key store. In this architecture, Azure Key Vault is deployed in the secure
environment to store secrets such as encryption keys and certificates. Key Vault is
accessed through a private endpoint by the resources in the secure VNet.

Governance considerations
Enable Azure Policy to enforce standards and provide automated remediation to bring
resources into compliance for specific policies. The policies can be applied to a project
subscription or at a management group level as a single policy or as part of a regulatory
Initiative.

For example, in this architecture Azure Policy Guest Configuration was applied to all VMs
in scope. The policy can audit operating systems and machine configuration for the Data
Science VMs.

VM image
The Data Science VMs run customized base images. To build the base image, we highly
recommend technologies like Azure Image Builder. This way you can create a repeatable
image that can be deployed when needed.

The base image might need updates, such as additional binaries. Those binaries should
be uploaded to the public blob storage and flow through the secure environment, much
like the datasets are uploaded by data owners.

Other considerations
Most research solutions are temporary workloads and don't need to be available for
extended periods. This architecture is designed as a single-region deployment with
availability zones. If the business requirements demand higher availability, replicate this
architecture in multiple regions. You would need other components, such as global load
balancer and distributor to route traffic to all those regions. As part of your recovery
strategy, capturing and creating a copy of the customized base image with Azure Image
Builder is highly recommended.

The size and type of the Data Science VMs should be appropriate to the style of work
being performed. This architecture is intended to support a single research project and
the scalability is achieved by adjusting the size and type of the VMs and the choices
made for compute resources available to AML.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

The cost of DSVMs depends on the choice of the underlying VM series. Because the
workload is temporary, the consumption plan is recommended for the Logic App
resource. Use the Azure pricing calculator to estimate costs based on estimated sizing
of resources needed.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Clayton Barlow | Senior Azure Specialist

Next steps
Microsoft Data Science Virtual Machine (DSVM)
What is Azure Machine Learning?
Azure Machine Learning Compute
Introduction to Azure Blob storage
Introduction to Azure Data Factory
Azure Virtual Desktop
Microsoft Defender for Cloud
Microsoft Sentinel
Azure Monitor
Azure Policy
Azure Policy Guest Configuration

Related resources
Compare the machine learning products and technologies from Microsoft
Azure Machine Learning architecture
Scale AI and machine learning initiatives in regulated industries
Many models machine learning (ML) at scale with Azure Machine Learning
Compare Microsoft machine learning
products and technologies
Article • 08/31/2023

Learn about the machine learning products and technologies from Microsoft. Compare
options to help you choose how to most effectively build, deploy, and manage your
machine learning solutions.

Cloud-based machine learning products


The following options are available for machine learning in the Azure cloud.

Cloud options What it is What you can do with it

Azure Machine Learning Managed platform for Use a pretrained model. Or, train,
machine learning deploy, and manage models on Azure
using Python and CLI

Azure Cognitive Services Pre-built AI capabilities Build intelligent applications quickly


implemented through using standard programming
REST APIs and SDKs languages. Doesn't require machine
learning and data science expertise

Azure SQL Managed Instance In-database machine Train and deploy models inside Azure
Machine Learning Services learning for SQL SQL Managed Instance

Machine learning in Azure Analytics service with Train and deploy models inside Azure
Synapse Analytics machine learning Synapse Analytics

Machine learning and AI with Machine learning in Train and deploy models inside Azure
ONNX in Azure SQL Edge SQL on IoT SQL Edge

Azure Databricks Apache Spark-based Build and deploy models and data
analytics platform workflows using integrations with
open-source machine learning
libraries and the MLFlow platform.

On-premises machine learning products


The following options are available for machine learning on-premises. On-premises
servers can also run in a virtual machine in the cloud.
On-premises options What it is What you can do with it

SQL Server Machine Learning Services In-database machine Train and deploy models inside
learning for SQL SQL Server

Machine Learning Services on SQL Machine learning in Train and deploy models on
Server Big Data Clusters Big Data Clusters SQL Server Big Data Clusters

Development platforms and tools


The following development platforms and tools are available for machine learning.

Platforms/tools What it is What you can do with it

Azure Data Science Virtual machine with pre- Develop machine learning solutions
Virtual Machine installed data science tools in a pre-configured environment

ML.NET Open-source, cross-platform Develop machine learning solutions


machine learning SDK for .NET applications

Windows ML Windows 10 machine Evaluate trained models on a


learning platform Windows 10 device

MMLSpark Open-source, distributed, Create and deploy scalable machine


machine learning and learning applications for Scala and
microservices framework for Python.
Apache Spark

Machine Learning Open-source and cross- Manage packages, import machine


extension for Azure Data platform machine learning learning models, make predictions,
Studio extension for Azure Data and create notebooks to run
Studio experiments for your SQL databases

Azure Machine Learning


Azure Machine Learning is a fully managed cloud service used to train, deploy, and
manage machine learning models at scale. It fully supports open-source technologies,
so you can use tens of thousands of open-source Python packages such as TensorFlow,
PyTorch, and scikit-learn. Rich tools are also available, such as Compute instances,
Jupyter notebooks, or the Azure Machine Learning for Visual Studio Code extension, a
free extension that allows you to manage your resources, model training workflows and
deployments in Visual Studio Code. Azure Machine Learning includes features that
automate model generation and tuning with ease, efficiency, and accuracy.
Use Python SDK, Jupyter notebooks, R, and the CLI for machine learning at cloud scale.
For a low-code or no-code option, use Azure Machine Learning's interactive designer in
the studio to easily and quickly build, test, and deploy models using pre-built machine
learning algorithms.

Try Azure Machine Learning for free .

Item Description

Type Cloud-based machine learning solution

Supported languages Python, R

Machine learning Model training


phases Deployment
MLOps/Management

Key benefits Code first (SDK) and studio & drag-and-drop designer web interface
authoring options.

Central management of scripts and run history, making it easy to


compare model versions.

Easy deployment and management of models to the cloud or edge


devices.

Considerations Requires some familiarity with the model management model.

Azure Cognitive Services


Azure Cognitive Services is a set of pre-built APIs that enable you to build apps that use
natural methods of communication. The term pre-built suggests that you do not need
to bring datasets or data science expertise to train models to use in your applications.
That's all done for you and packaged as APIs and SDKs that allow your apps to see, hear,
speak, understand, and interpret user needs with just a few lines of code. You can easily
add intelligent features to your apps, such as:

Vision: Object detection, face recognition, OCR, etc. See Computer Vision, Face,
Form Recognizer.
Speech: Speech-to-text, text-to-speech, speaker recognition, etc. See Speech
Service.
Language: Translation, Sentiment analysis, key phrase extraction, language
understanding, etc. See Translator, Text Analytics, Language Understanding, QnA
Maker
Decision: Anomaly detection, content moderation, reinforcement learning. See
Anomaly Detector, Content Moderator, Personalizer.

Use Cognitive Services to develop apps across devices and platforms. The APIs keep
improving, and are easy to set up.

Item Description

Type APIs for building intelligent applications

Supported Various options depending on the service. Standard ones are C#, Java,
languages JavaScript, and Python.

Machine learning Deployment


phases

Key benefits Build intelligent applications using pre-trained models available through
REST API and SDK.
Variety of models for natural communication methods with vision, speech,
language, and decision.
No machine learning or data science expertise required.

SQL machine learning


SQL machine learning adds statistical analysis, data visualization, and predictive analytics
in Python and R for relational data, both on-premises and in the cloud. Current
platforms and tools include:

SQL Server Machine Learning Services


Machine Learning Services on SQL Server Big Data Clusters
Azure SQL Managed Instance Machine Learning Services
Machine learning in Azure Synapse Analytics
Machine learning and AI with ONNX in Azure SQL Edge
Machine Learning extension for Azure Data Studio

Use SQL machine learning when you need built-in AI and predictive analytics on
relational data in SQL.

Item Description

Type On-premises predictive analytics for relational data

Supported languages Python, R, SQL

Machine learning Data preparation


phases Model training
Item Description

Deployment

Key benefits Encapsulate predictive logic in a database function, making it easy to


include in data-tier logic.

Considerations Assumes a SQL database as the data tier for your application.

Azure Data Science Virtual Machine


The Azure Data Science Virtual Machine is a customized virtual machine environment on
the Microsoft Azure cloud. It is available in versions for both Windows and Linux
Ubuntu. The environment is built specifically for doing data science and developing ML
solutions. It has many popular data science, ML frameworks, and other tools pre-
installed and pre-configured to jump-start building intelligent applications for advanced
analytics.

Use the Data Science VM when you need to run or host your jobs on a single node. Or if
you need to remotely scale up your processing on a single machine.

Item Description

Type Customized virtual machine environment for data science

Key benefits Reduced time to install, manage, and troubleshoot data science tools and
frameworks.

The latest versions of all commonly used tools and frameworks are included.

Virtual machine options include highly scalable images with GPU capabilities for
intensive data modeling.

Considerations The virtual machine cannot be accessed when offline.

Running a virtual machine incurs Azure charges, so you must be careful to have
it running only when required.

Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform optimized for the
Microsoft Azure cloud services platform. Databricks is integrated with Azure to provide
one-click setup, streamlined workflows, and an interactive workspace that enables
collaboration between data scientists, data engineers, and business analysts. Use
Python, R, Scala, and SQL code in web-based notebooks to query, visualize, and model
data.

Use Databricks when you want to collaborate on building machine learning solutions on
Apache Spark.

Item Description

Type Apache Spark-based analytics platform

Supported languages Python, R, Scala, SQL

Machine learning phases Data preparation


Data preprocessing
Model training
Model tuning
Model inference
Management
Deployment

ML.NET
ML.NET is an open-source, and cross-platform machine learning framework. With
ML.NET, you can build custom machine learning solutions and integrate them into your
.NET applications. ML.NET offers varying levels of interoperability with popular
frameworks like TensorFlow and ONNX for training and scoring machine learning and
deep learning models. For resource-intensive tasks like training image classification
models, you can take advantage of Azure to train your models in the cloud.

Use ML.NET when you want to integrate machine learning solutions into your .NET
applications. Choose between the API for a code-first experience and Model Builder or
the CLI for a low-code experience.

Item Description

Type Open-source cross-platform framework for developing custom machine


learning applications with .NET

Languages C#, F#
supported

Machine learning Data preparation


phases Training
Deployment

Key benefits Data science & ML experience not required


Use familiar tools (Visual Studio, VS Code) and languages
Item Description

Deploy where .NET runs


Extensible
Scalable
Local-first experience

Windows ML
Windows ML inference engine allows you to use trained machine learning models in
your applications, evaluating trained models locally on Windows 10 devices.

Use Windows ML when you want to use trained machine learning models within your
Windows applications.

Item Description

Type Inference engine for trained models in Windows devices

Languages supported C#/C++, JavaScript

MMLSpark
Microsoft ML for Apache Spark (MMLSpark) is an open-source library that expands
the distributed computing framework Apache Spark . MMLSpark adds many deep
learning and data science tools to the Spark ecosystem, including seamless integration
of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK),
LightGBM , LIME (Model Interpretability) , and OpenCV . You can use these tools to
create powerful predictive models on any Spark cluster, such as Azure Databricks or
Cosmic Spark.

MMLSpark also brings new networking capabilities to the Spark ecosystem. With the
HTTP on Spark project, users can embed any web service into their SparkML models.
Additionally, MMLSpark provides easy-to-use tools for orchestrating Azure Cognitive
Services at scale. For production-grade deployment, the Spark Serving project enables
high throughput, submillisecond latency web services, backed by your Spark cluster.

Item Description

Type Open-source, distributed machine learning and microservices


framework for Apache Spark

Languages supported Scala 2.11, Java, Python 3.5+, R (beta)


Item Description

Machine learning Data preparation


phases Model training
Deployment

Key benefits Scalability


Streaming + Serving compatible
Fault-tolerance

Considerations Requires Apache Spark

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Zoiner Tejada | CEO and Architect

Next steps
To learn about all the Artificial Intelligence (AI) development products available
from Microsoft, see Microsoft AI platform .
For training in developing AI and Machine Learning solutions with Microsoft, see
Microsoft Learn training.

Related resources
Choose a Microsoft cognitive services technology
Artificial intelligence (AI) architecture design
How Azure Machine Learning works: resources and assets
Query-based document summarization
Article • 08/31/2023

This guide shows how to perform document summarization by using the Azure OpenAI
GPT-3 model. It describes concepts that are related to the document summarization
process, approaches to the process, and recommendations on which model to use for
specific use cases. Finally, it presents two use cases, together with sample code snippets,
to help you understand key concepts.

Architecture
The following diagram shows how a user query fetches relevant data. The summarizer
uses GPT-3 to generate a summary of the text of the most relevant document. In this
architecture, the GPT-3 endpoint is used to summarize the text.

Download a PowerPoint file of this architecture.

Workflow
This workflow occurs in near-real time.

1. A user sends a query. For example, an employee of a manufacturing company


searches for specific information about a machine part on the company portal. The
query is first processed by an intent recognizer like conversational language
understanding. The relevant entities or concepts in the user query are used to
select and present a subset of documents from a knowledge base that's populated
offline (in this case, the company's knowledge base database). The output is fed
into a search and analysis engine like Azure Elastic Search , which filters the
relevant documents to return a document set of hundreds instead of thousands or
tens of thousands.
2. The user query is applied again on a search endpoint like Azure Cognitive Search
to rank the retrieved document set in order of relevance (page ranking). The
highest-ranked document is selected.
3. The selected document is scanned for relevant sentences. This scanning process
uses either a coarse method, like extracting all sentences that contain the user
query, or a more sophisticated method, like GPT-3 embeddings, to find
semantically similar material in the document.
4. After the relevant text is extracted, the GPT-3 Completions endpoint with the
summarizer summarizes the extracted content. In this example, the summary of
important details about the part that the employee specified in the query is
returned.

This article focuses on the summarizer component of the architecture.

Scenario details
Enterprises frequently create and maintain a knowledge base about business processes,
customers, products, and information. However, returning relevant content based on a
user query of a large dataset is often challenging. The user can query the knowledge
base and find an applicable document by using methods like page rank, but delving
further into the document to search for relevant information typically becomes a manual
task that takes time. However, with recent advances in foundation transformer models
like the one developed by OpenAI, the query mechanism has been refined by semantic
search methods that use encoding information like embeddings to find relevant
information. These developments enable the ability to summarize content and present it
to the user in a concise and succinct way.

Document summarization is the process of creating summaries from large volumes of


data while maintaining significant informational elements and content value. This article
demonstrates how to use Azure OpenAI Service GPT-3 capabilities for your specific
use case. GPT-3 is a powerful tool that you can use for a range of natural language
processing tasks, including language translation, chatbots, text summarization, and
content creation. The methods and architecture described here are customizable and
can be applied to many datasets.

Potential use cases


Document summarization applies to any organizational domain that requires users to
search large amounts of reference data and generate a summary that concisely
describes relevant information. Typical domains include legal, financial, news, healthcare,
and academic organizations. Potential use cases of summarization are:

Generating summaries to highlight key insights about news, financial reporting,


and so on.
Creating a quick reference to support an argument, for example, in legal
proceedings.
Providing context for a paper's thesis, as in academic settings.
Writing literature reviews.
Annotating a bibliography.

Some benefits of using a summarization service for any use case are:

Reduced reading time.


More effective searching of large volumes of disparate data.
Reduced chance of bias from human summarization techniques. (This benefit
depends on how unbiased the training data is.)
Enabling employees and users to focus on more in-depth analysis.

In-context learning
Azure OpenAI Service uses a generative completion model. The model uses natural
language instructions to identify the requested task and the skill required, a process
known as prompt engineering. When you use this approach, the first part of the prompt
includes natural language instructions and/or examples of the desired task. The model
completes the task by predicting the most probable next text. This technique is known
as in-context learning.

With in-context learning, language models can learn tasks from just a few examples. The
language model is provided with a prompt that contains a list of input-output pairs that
demonstrate a task, and then with a test input. The model makes a prediction by
conditioning on the prompt and predicting the next tokens.

There are three main approaches to in-context learning: zero-shot learning, few-shot
learning, and fine-tuning methods that change and improve the output. These
approaches vary based on the amount of task-specific data that's provided to the
model.

Zero-shot: In this approach, no examples are provided to the model. Only the task
request is provided as input. In zero-shot learning, the model depends on previously
trained concepts. It responds based only on data that it's trained on. It doesn't
necessarily understand the semantic meaning, but it has a statistic understanding that's
based on everything that it's learned from the internet about what should be generated
next. The model attempts to relate the given task to existing categories that it has
already learned about and responds accordingly.

Few-shot: In this approach, several examples that demonstrate the expected answer
format and content are included in the call prompt. The model is provided with a very
small training dataset to guide its predictions. Training with a small set of examples
enables the model to generalize and understand unrelated but previously unseen tasks.
Creating few-shot examples can be challenging because you need to accurately
articulate the task that you want the model to perform. One commonly observed
problem is that models are sensitive to the writing style that's used in the training
examples, especially small models.

Fine-tuning: Fine-tuning is a process of tailoring models to your own datasets. In this


customization step, you can improve the process by:

Including a larger set of data (at least 500 examples).


Using traditional optimization techniques with backpropagation to readjust the
weights of the model. These techniques enable higher quality results than the
zero-shot or few-shot approaches provide by themselves.
Improving the few-shot approach by training the model weights with specific
prompts and a specific structure. This technique enables you to achieve better
results on a wider number of tasks without needing to provide examples in the
prompt. The result is less text sent and fewer tokens.

When you create a GPT-3 solution, the main effort is in the design and content of the
training prompt.

Prompt engineering
Prompt engineering is a natural language processing discipline that involves discovering
inputs that yield desirable or useful outputs. When a user prompts the system, the way
the content is expressed can dramatically change the output. Prompt design is the most
significant process for ensuring that the GPT-3 model provides a desirable and
contextual response.

The architecture described in this article uses the completions endpoint for
summarization. The completions endpoint is an Azure Cognitive Services API that
accepts a partial prompt or context as input and returns one or more outputs that
continue or complete the input text. A user provides input text as a prompt, and the
model generates text that attempts to match the context or pattern that's provided.
Prompt design is highly dependent on the task and data. Incorporating prompt
engineering into a fine-tuning dataset and investigating what works best before using
the system in production requires significant time and effort.

Prompt design
GPT-3 models can perform multiple tasks, so you need to be explicit in the goals of the
design. The models estimate the desired output based on the provided prompt.

For example, if you input the words "Give me a list of cat breeds," the model doesn't
automatically assume that you're asking for a list of cat breeds. You could be asking the
model to continue a conversation in which the first words are "Give me a list of cat
breeds" and the next ones are "and I'll tell you which ones I like." If the model just
assumed that you wanted a list of cats, it wouldn't be as good at content creation,
classification, or other tasks.

As described in Learn how to generate or manipulate text, there are three basic
guidelines for creating prompts:

Show and tell. Improve the clarity about what you want by providing instructions,
examples, or a combination of the two. If you want the model to rank a list of
items in alphabetical order or to classify a paragraph by sentiment, show it that
that's what you want.
Provide quality data. If you're building a classifier or want a model to follow a
pattern, be sure to provide enough examples. You should also proofread your
examples. The model can usually recognize spelling mistakes and return a
response, but it might assume misspellings are intentional, which can affect the
response.
Check your settings. The temperature and top_p settings control how
deterministic the model is in generating a response. If you ask it for a response
that has only one right answer, configure these settings at a lower level. If you
want more diverse responses, you might want to configure the settings at a higher
level. A common error is to assume that these settings are "cleverness" or
"creativity" controls.

Alternatives
Azure conversational language understanding is an alternative to the summarizer used
here. The main purpose of conversational language understanding is to build models
that predict the overall intention of an incoming utterance, extract valuable information
from it, and produce a response that aligns with the topic. It's useful in chatbot
applications when it can refer to an existing knowledge base to find the suggestion that
best corresponds to the incoming utterance. It doesn't help much when the input text
doesn't require a response. The intent in this architecture is to generate a short
summary of long textual content. The essence of the content is described in a concise
manner and all important information is represented.

Example scenarios

Use case: Summarizing legal documents


In this use case, a collection of legislative bills passed through Congress is summarized.
The summary is fine-tuned to bring it closer to a human-generated summary, which is
referred to as the ground truth summary.

Zero-shot prompt engineering is used to summarize the bills. The prompt and settings
are then modified to generate different summary outputs.

Dataset

The first dataset is the BillSum dataset for summarization of US Congressional and
California state bills. This example uses only the Congressional bills. The data is split into
18,949 bills to use for training and 3,269 bills to use for testing. BillSum focuses on mid-
length legislation that's between 5,000 and 20,000 characters long. It's cleaned and
preprocessed.

For more information about the dataset and instructions for download, see FiscalNote /
BillSum .

BillSum schema

The schema of the BillSum dataset includes:

bill_id . An identifier for the bill.

text . The bill text.

summary . A human-written summary of the bill.


title . The bill title.

text_len . The character length of the bill.


sum_len . The character length of the bill summary.

In this use case, the text and summary elements are used.
Zero-shot
The goal here is to teach the GPT-3 model to learn conversation-style input. The
completions endpoint is used to create an Azure OpenAI API and a prompt that
generates the best summary of the bill. It's important to create the prompts carefully so
that they extract relevant information. To extract general summaries from a given bill,
the following format is used.

Prefix: What you want it to do.


Context primer: The context.
Context: The information needed to provide a response. In this case, the text to
summarize.
Suffix: The intended form of the answer. For example, an answer, a completion, or
a summary.

Python

API_KEY = # SET YOUR OWN API KEY HERE


RESOURCE_ENDPOINT = " -- # SET A LINK TO YOUR RESOURCE ENDPOINT -- "

openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01-preview"
prompt_i = 'Summarize the legislative bill given the title and the
text.\n\nTitle:\n'+" ".join([normalize_text(bill_title_1)])+ '\n\nText:\n'+
" ".join([normalize_text(bill_text_1)])+'\n\nSummary:\n'
response = openai.Completion.create(
engine=TEXT_DAVINCI_001
prompt=prompt_i,
temperature=0.4,
max_tokens=500,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
stop=['\n\n###\n\n'], # The ending token used during inference. Once it
reaches this token, GPT-3 knows the completion is over.
best_of=1
)
= 1

Original text: SAMPLE_BILL_1 .

Ground truth: National Science Education Tax Incentive for Businesses Act of 2007 -
Amends the Internal Revenue Code to allow a general business tax credit for
contributions of property or services to elementary and secondary schools and for
teacher training to promote instruction in science, technology, engineering, or
mathematics.
Zero-shot model summary: The National Science Education Tax Incentive for Businesses
Act of 2007 would create a new tax credit for businesses that make contributions to
science, technology, engineering, and mathematics (STEM) education at the elementary
and secondary school level. The credit would be equal to 100 percent of the qualified
STEM contributions of the taxpayer for the taxable year. Qualified STEM contributions
would include STEM school contributions, STEM teacher externship expenses, and STEM
teacher training expenses.

Observations: The zero-shot model generates a succinct, generalized summary of the


document. It's similar to the human-written ground truth and captures the same key
points. It's organized like a human-written summary and remains focused on the point.

Fine-tuning
Fine-tuning improves upon zero-shot learning by training on more examples than you
can include in the prompt, so you achieve better results on a wider number of tasks.
After a model is fine-tuned, you don't need to provide examples in the prompt. Fine-
tuning saves money by reducing the number of tokens required and enables lower-
latency requests.

At a high level, fine-tuning includes these steps:

Prepare and upload training data.


Train a new fine-tuned model.
Use the fine-tuned model.

For more information, see How to customize a model with Azure OpenAI Service.

Prepare data for fine-tuning

This step enables you to improve upon the zero-shot model by incorporating prompt
engineering into the prompts that are used for fine-tuning. Doing so helps give
directions to the model on how to approach the prompt/completion pairs. In a fine-tune
model, prompts provide a starting point that the model can learn from and use to make
predictions. This process enables the model to start with a basic understanding of the
data, which can then be improved upon gradually as the model is exposed to more data.
Additionally, prompts can help the model to identify patterns in the data that it might
otherwise miss.

The same prompt engineering structure is also used during inference, after the model is
finished training, so that the model recognizes the behavior that it learned during
training and can generate completions as instructed.
Python

#Adding variables used to design prompts consistently across all examples


#You can learn more here: https://learn.microsoft.com/azure/cognitive-
services/openai/how-to/prepare-dataset

LINE_SEP = " \n "


PROMPT_END = " [end] "
#Injecting the zero-shot prompt into the fine-tune dataset
def stage_examples(proc_df):
proc_df['prompt'] = proc_df.apply(lambda x:"Summarize the legislative
bill. Do not make up facts.\n\nText:\n"+"
".join([normalize_text(x['prompt'])])+'\n\nSummary:', axis=1)
proc_df['completion'] = proc_df.apply(lambda x:"
"+normalize_text(x['completion'])+PROMPT_END, axis=1)

return proc_df

df_staged_full_train = stage_examples(df_prompt_completion_train)
df_staged_full_val = stage_examples(df_prompt_completion_val)

Now that the data is staged for fine-tuning in the proper format, you can start running
the fine-tune commands.

Next, you can use the OpenAI CLI to help with some of the data preparation steps. The
OpenAI tool validates data, provides suggestions, and reformats data.

Python

openai tools fine_tunes.prepare_data -f


data/billsum_v4_1/prompt_completion_staged_train.csv

openai tools fine_tunes.prepare_data -f


data/billsum_v4_1/prompt_completion_staged_val.csv

Fine-tune the dataset

Python

payload = {
"model": "curie",
"training_file": " -- INSERT TRAINING FILE ID -- ",
"validation_file": "-- INSERT VALIDATION FILE ID --",
"hyperparams": {
"n_epochs": 1,
"batch_size": 200,
"learning_rate_multiplier": 0.1,
"prompt_loss_weight": 0.0001
}
}

url = RESOURCE_ENDPOINT + "openai/fine-tunes?api-version=2022-12-01-preview"


r = requests.post(url,
headers={
"api-key": API_KEY,
"Content-Type": "application/json"
},
json = payload
)
data = r.json()
print(data)
fine_tune_id = data['id']
print('Endpoint Called: {endpoint}'.format(endpoint = url))
print('Status Code: {status}'.format(status= r.status_code))
print('Fine tuning job ID: {id}'.format(id=fine_tune_id))
print('Response Information \n\n {text}'.format(text=r.text))

Evaluate the fine-tuned model

This section demonstrates how to evaluate the fine-tuned model.

Python

#Run this cell to check status


url = RESOURCE_ENDPOINT + "openai/fine-tunes/<--insert fine-tune id-->?api-
version=2022-12-01-preview"
r = requests.get(url,
headers={
"api-key": API_KEY,
"Content-Type": "application/json"
}
)

data = r.json()
print('Endpoint Called: {endpoint}'.format(endpoint = url))
print('Status Code: {status}'.format(status= r.status_code))
print('Fine tuning ID: {id}'.format(id=fine_tune_id))
print('Status: {status}'.format(status = data['status']))
print('Response Information \n\n {text}'.format(text=r.text))

Original text: SAMPLE_BILL_1 .

Ground truth: National Science Education Tax Incentive for Businesses Act of 2007 -
Amends the Internal Revenue Code to allow a general business tax credit for
contributions of property or services to elementary and secondary schools and for
teacher training to promote instruction in science, technology, engineering, or
mathematics.
Fine-tuned model summary: This bill provides a tax credit for contributions to
elementary and secondary schools that benefit science, technology, engineering, and
mathematics education. The credit is equal to 100% of qualified STEM contributions
made by taxpayers during the taxable year. Qualified STEM contributions include: (1)
STEM school contributions, (2) STEM teacher externship expenses, and (3) STEM teacher
training expenses. The bill also provides a tax credit for contributions to elementary and
secondary schools that benefit science, technology, engineering, or mathematics
education. The credit is equal to 100% of qualified STEM service contributions made by
taxpayers during the taxable year. Qualified STEM service contributions include: (1)
STEM service contributions paid or incurred during the taxable year for services
provided in the United States or on a military base outside the United States; and (2)
STEM inventory property contributed during the taxable year which is used by an
educational organization located in the United States or on a military base outside the
United States in providing education in grades K-12 in science, technology, engineering
or mathematics.

For the results of summarizing a few more bills by using the zero-shot and fine-tune
approaches, see Results for BillSum Dataset .

Observations: Overall, the fine-tuned model does an excellent job of summarizing the
bill. It captures domain-specific jargon and the key points that are represented but not
explained in the human-written ground truth. It differentiates itself from the zero-shot
model by providing a more detailed and comprehensive summary.

Use case: Financial reports


In this use case, zero-shot prompt engineering is used to create summaries of financial
reports. A summary of summaries approach is then used to generate the results.

Summary of summaries approach


When you write prompts, the GPT-3 total of the prompt and the resulting completion
must include fewer than 4,000 tokens, so you're limited to a couple pages of summary
text. For documents that typically contain more than 4,000 tokens (roughly 3,000 words),
you can use a summary of summaries approach. When you use this approach, the entire
text is first divided up to meet the token constraints. Summaries of the shorter texts are
then derived. In the next step, a summary of the summaries is created. This use case
demonstrates the summary of summaries approach with a zero-shot model. This
solution is useful for long documents. Additionally, this section describes how different
prompt engineering practices can vary the results.
7 Note

Fine-tuning is not applied in the financial use case because there's not enough data
available to complete that step.

Dataset
The dataset for this use case is technical and includes key quantitative metrics to assess
a company's performance.

The financial dataset includes:

url : The URL for the financial report.


pages : The page in the report that contains key information to be summarized (1-

indexed).
completion : The ground truth summary of the report.

comments : Any additional information that's needed.

In this use case, Rathbone's financial report , from the dataset, will be summarized.
Rathbone's is an individual investment and wealth management company for private
clients. The report highlights Rathbone's performance in 2020 and mentions
performance metrics like profit, FUMA, and income. The key information to summarize is
on page 1 of the PDF.

Python

API_KEY = # SET YOUR OWN API KEY HERE


RESOURCE_ENDPOINT = "# SET A LINK TO YOUR RESOURCE ENDPOINT"

openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01-preview"
name = os.path.abspath(os.path.join(os.getcwd(), '---INSERT PATH OF LOCALLY
DOWNLOADED RATHBONES_2020_PRELIM_RESULTS---')).replace('\\', '/')

pages_to_summarize = [0]
# Using pdfminer.six to extract the text
# !pip install pdfminer.six
from pdfminer.high_level import extract_text
t = extract_text(name
, page_numbers=pages_to_summarize
)
print("Text extracted from " + name)
t
Zero-shot approach

When you use the zero-shot approach, you don't provide solved examples. You provide
only the command and the unsolved input. In this example, the Instruct model is used.
This model is specifically intended to take in an instruction and record an answer for it
without extra context, which is ideal for the zero-shot approach.

After you extract the text, you can use various prompts to see how they influence the
quality of the summary:

Python

#Using the text from the Rathbone's report, you can try different prompts to
see how they affect the summary

prompt_i = 'Summarize the key financial information in the report using


qualitative metrics.\n\nText:\n'+" ".join([normalize_text(t)])+'\n\nKey
metrics:\n'

response = openai.Completion.create(
engine="davinci-instruct",
prompt=prompt_i,
temperature=0,
max_tokens=2048-int(len(prompt_i.split())*1.5),
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(response.choices[0].text)
>>>
- Funds under management and administration (FUMA) reached £54.7 billion at
31 December 2020, up 8.5% from £50.4 billion at 31 December 2019
- Operating income totalled £366.1 million, 5.2% ahead of the prior year
(2019: £348.1 million)
- Underlying1 profit before tax totalled £92.5 million, an increase of 4.3%
(2019: £88.7 million); underlying operating margin of 25.3% (2019: 25.5%)

# Different prompt

prompt_i = 'Extract most significant money related values of financial


performance of the business like revenue, profit, etc. from the below text
in about two hundred words.\n\nText:\n'+"
".join([normalize_text(t)])+'\n\nKey metrics:\n'

response = openai.Completion.create(
engine="davinci-instruct",
prompt=prompt_i,
temperature=0,
max_tokens=2048-int(len(prompt_i.split())*1.5),
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(response.choices[0].text)
>>>
- Funds under management and administration (FUMA) grew by 8.5% to reach
£54.7 billion at 31 December 2020
- Underlying profit before tax increased by 4.3% to £92.5 million,
delivering an underlying operating margin of 25.3%
- The board is announcing a final 2020 dividend of 47 pence per share, which
brings the total dividend to 72 pence per share, an increase of 2.9% over
2019

Challenges

As you can see, the model might produce metrics that aren't mentioned in the
original text.

Proposed solution: You can resolve this problem by changing the prompt.

The summary might focus on one section of the article and neglect other
important information.

Proposed solution: You can try a summary of summaries approach. Divide the
report into sections and create smaller summaries that you can then summarize to
create the output summary.

This code implements the proposed solutions:

Python

# Body of function

from pdfminer.high_level import extract_text

text = extract_text(name
, page_numbers=pages_to_summarize
)

r = splitter(200, text)

tok_l = int(2000/len(r))
tok_l_w = num2words(tok_l)

res_lis = []
# Stage 1: Summaries
for i in range(len(r)):
prompt_i = f'Extract and summarize the key financial numbers and
percentages mentioned in the Text in less than {tok_l_w}
words.\n\nText:\n'+normalize_text(r[i])+'\n\nSummary in one paragraph:'
response = openai.Completion.create(
engine=TEXT_DAVINCI_001,
prompt=prompt_i,
temperature=0,
max_tokens=tok_l,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
t = trim_incomplete(response.choices[0].text)
res_lis.append(t)

# Stage 2: Summary of summaries


prompt_i = 'Summarize the financial performance of the business like
revenue, profit, etc. in less than one hundred words. Do not make up values
that are not mentioned in the Text.\n\nText:\n'+"
".join([normalize_text(res) for res in res_lis])+'\n\nSummary:\n'

response = openai.Completion.create(
engine=TEXT_DAVINCI_001,
prompt=prompt_i,
temperature=0,
max_tokens=200,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)

print(trim_incomplete(response.choices[0].text))

The input prompt includes the original text from Rathbone's financial report for a
specific year.

Ground truth: Rathbones has reported revenue of £366.1m in 2020, up from £348.1m in
2019, and an increase in underlying profit before tax to £92.5m from £88.7m. Assets
under management rose 8.5% from £50.4bn to £54.7bn, with assets in wealth
management increasing 4.4% to £44.9bn. Net inflows were £2.1bn in 2020 compared
with £600m in the previous year, driven primarily by £1.5bn inflows into its funds
business and £400m due to the transfer of assets from Barclays Wealth.

Zero-shot summary of summaries output: Rathbones delivered a strong performance


in 2020, with funds under management and administration (FUMA) growing by 8.5% to
reach £54.7 billion at the end of the year. Underlying profit before tax increased by 4.3%
to £92.5 million, delivering an underlying operating margin of 25.3%. Total net inflows
across the group were £2.1 billion, representing a growth rate of 4.2%. Profit before tax
for the year was £43.8 million, with basic earnings per share totalling 49.6p. Operating
income for the year was 5.2% ahead of the prior year, totalling £366.1 million.

Observations: The summary of summaries approach generates a great result set that
resolves the challenges encountered initially when a more detailed and comprehensive
summary was provided. It does a great job of capturing the domain-specific jargon and
the key points, which are represented in the ground truth but not explained well.

The zero-shot model works well for summarizing mainstream documents. If the data is
industry-specific or topic-specific, contains industry-specific jargon, or requires industry-
specific knowledge, fine-tuning performs best. For example, this approach works well for
medical journals, legal forms, and financial statements. You can use the few-shot
approach instead of zero-shot to provide the model with examples of how to formulate
a summary, so it can learn to mimic the summary provided. For the zero-shot approach,
this solution doesn't retrain the model. The model's knowledge is based on the GPT-3
training. GPT-3 is trained with almost all available data from the internet. It performs
well for tasks that don't require specific knowledge.

For the results of using the zero-shot summary of summaries approach on a few reports
in the financial dataset, see Results for Summary of Summaries .

Recommendations
There are many ways to approach summarization by using GPT-3, including zero-shot,
few-shot, and fine-tuning. The approaches produce summaries of varying quality. You
can explore which approach produces the best results for your intended use case.

Based on observations on the testing presented in this article, here are few
recommendations:

Zero-shot is best for mainstream documents that don't require specific domain
knowledge. This approach attempts to capture all high-level information in a
succinct, human-like manner and provides a high-quality baseline summary. Zero-
shot creates a high-quality summary for the legal dataset that's used in the tests in
this article.
Few-shot is difficult to use for summarizing long documents because the token
limitation is exceeded when an example text is provided. You can instead use a
zero-shot summary of summaries approach for long documents or increase the
dataset to enable successful fine-tuning. The summary of summaries approach
generates excellent results for the financial dataset that's used in these tests.
Fine-tuning is most useful for technical or domain-specific use cases when the
information isn't readily available. To achieve the best results with this approach,
you need a dataset that contains a couple thousand samples. Fine-tuning captures
the summary in a few templated ways, trying to conform to how the dataset
presents the summaries. For the legal dataset, this approach generates a higher
quality of summary than the one created by the zero-shot approach.

Evaluating summarization
There are multiple techniques for evaluating the performance of summarization models.

Here are a few:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation). This technique includes


measures for automatically determining the quality of a summary by comparing it to
ideal summaries created by humans. The measures count the number of overlapping
units, like n-gram, word sequences, and word pairs, between the computer-generated
summary being evaluated and the ideal summaries.

Here's an example:

Python

reference_summary = "The cat ison porch by the tree"


generated_summary = "The cat is by the tree on the porch"
rouge = Rouge()
rouge.get_scores(generated_summary, reference_summary)
[{'rouge-1': {'r':1.0, 'p': 1.0, 'f': 0.999999995},
'rouge-2': {'r': 0.5714285714285714, 'p': 0.5, 'f': 0.5333333283555556},
'rouge-1': {'r': 0.75, 'p': 0.75, 'f': 0.749999995}}]

BERTScore. This technique computes similarity scores by aligning generated and


reference summaries on a token level. Token alignments are computed greedily to
maximize the cosine similarity between contextualized token embeddings from BERT.

Here's an example:

Python

import torchmetrics
from torchmetrics.text.bert import BERTScore
preds = "You should have ice cream in the summer"
target = "Ice creams are great when the weather is hot"
bertscore = BERTScore()
score = bertscore(preds, target)
print(score)
Similarity matrix. A similarity matrix is a representation of the similarities between
different entities in a summarization evaluation. You can use it to compare different
summaries of the same text and measure their similarity. It's represented by a two-
dimensional grid, where each cell contains a measure of the similarity between two
summaries. You can measure the similarity by using various methods, like cosine
similarity, Jaccard similarity, and edit distance. You then use the matrix to compare the
summaries and determine which one is the most accurate representation of the original
text.

Here's a sample command that gets the similarity matrix of a BERTScore comparison of
two similar sentences:

Python

bert-score-show --lang en -r "The cat is on the porch by the tree"


-c "The cat is by the tree on the porch"
-f out.png

The first sentence, "The cat is on the porch by the tree", is referred to as the candidate.
The second sentence is referred to as the reference. The command uses BERTScore to
compare the sentences and generate a matrix.

This following matrix displays the output that's generated by the preceding command:
For more information, see SummEval: Reevaluating Summarization Evaluation . For a
PyPI toolkit for summarization, see summ-eval 0.892 .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Noa Ben-Efraim | Data & Applied Scientist

Other contributors:

Mick Alberts | Technical Writer


Rania Bayoumy | Senior Technical Program Manager
Harsha Viswanath | Principal Applied Science Manager
To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Azure OpenAI - Documentation, quickstarts, API reference
What are intents in LUIS?
Conversational language understanding
Jupyter Notebook with technical details and execution of this use case

Related resources
AI architecture design
Choose a Microsoft cognitive services technology
Natural language processing technology
Build language model pipelines
with memory
Bing Web Search Azure Cache for Redis Azure Pipelines

Stay ahead of the competition by being informed and having a deep understanding of
your products and competitor products. An AI/machine learning pipeline helps you
quickly and efficiently gather, analyze, and summarize relevant information. This
architecture includes several powerful Azure OpenAI Service models. These models pair
with the popular open-source LangChain framework that's used to develop applications
that are powered by language models.

7 Note

Some parts in the introduction, components, and workflow of this article were
generated with the help of ChatGPT! Try it for yourself , or try it for your
enterprise.

Architecture

Download a PowerPoint file of this architecture.


Workflow
The batch pipeline stores internal company product information in a fast vector search
database. To achieve this result, the following steps are taken:

1. Internal company documents for products are imported and converted into
searchable vectors. Product-related documents are collected from departments,
such as sales, marketing, and product development. These documents are then
scanned and converted into text by using optical character recognition (OCR)
technology.
2. A LangChain chunking utility chunks the documents into smaller, more
manageable pieces. Chunking breaks down the text into meaningful phrases or
sentences that can be analyzed separately and improves the accuracy of the
pipeline's search capabilities.
3. The language model converts each chunk into a vectorized embedding.
Embeddings are a type of representation that capture the meaning and context of
the text. By converting each chunk into a vectorized embedding, you can store and
search for documents based on their meaning rather than their raw text. To
prevent loss of context within each document chunk, LangChain provides several
utilities for this text splitting step, like capabilities for sliding windows or specifying
text overlap. Some key features include utilities for tagging chunks with document
metadata, optimizing the document retrieval step, and downstream reference.
4. Create an index in a vector store database to store the raw document text,
embeddings vectors, and metadata. The resulting embeddings are stored in a
vector store database along with the raw text of the document and any relevant
metadata, such as the document's title and source.

After the batch pipeline is complete, the real-time, asynchronous pipeline searches for
relevant information. The following steps are taken:

5. Enter a query and relevant metadata, such as your role in the company or the
business unit that you work in. An embeddings model then converts your query
into a vectorized embedding.
6. The orchestrator language model decomposes your query, or main task, into the
set of subtasks that are required to answer your query. Converting the main task
into a series of simpler subtasks allows the language model to address each task
more accurately, which results in better answers with less tendency for inaccuracy.
7. The resulting embedding and decomposed subtasks are stored in the LangChain
model's memory.
a. Top internal document chunks that are relevant to your query are retrieved from
your internal database. A fast vector search is performed for the top n similar
documents that are stored as vectors in Azure Cache for Redis.
b. In parallel, a web search for similar external products is performed via the
LangChain Bing Search language model plugin with a generated search query
that the orchestrator language model composes. Results are stored in the
external model memory component.
8. The vector store database is queried and returns the top relevant product
information pages (chunks and references). The system queries the vector store
database by using your query embedding and returns the most relevant product
information pages, along with the relevant text chunks and references. The
relevant information is stored in LangChain's model memory.
9. The system uses the information that’s stored in LangChain's model memory to
create a new prompt, which is sent to the orchestrator language model to build a
summary report that’s based on your query, company internal knowledge base,
and external web results.
10. Optionally, the output from the previous step is passed to a moderation filter to
remove unwanted information. The final competitive product report is passed to
you.

Components
Azure OpenAI Service provides REST API access to OpenAI's powerful language
models, including the GPT-3, GPT-3.5, GPT-4, and embeddings model series. You
can easily adapt these models to your specific task, such as content generation,
summarization, semantic search, converting text to semantically powerful
embeddings vectors, and natural-language-to-code translation.

LangChain is a third-party, open-source framework that you can use to develop


applications that are powered by language models. LangChain makes the
complexities of working and building with AI models easier by providing the
pipeline orchestration framework and helper utilities to run powerful, multiple-
model pipelines.

Memory refers to capturing information. By default, language modeling chains (or


pipelines) and agents operate in a stateless manner. They handle each incoming
query independently, just like the underlying language models and chat models
that they use. But in certain applications, such as chatbots, it's crucial to retain
information from past interactions in the short term and the long term. This area is
where the concept of "memory" comes into play. LangChain provides convenient
utility tools to manage and manipulate past chat messages. These utilities are
designed to be modular regardless of their specific usage. LangChain also offers
seamless methods to integrate these utilities into the memory of chains by using
language models .
Semantic Kernel is an open-source software development kit (SDK) that you can
use to orchestrate and deploy language models. You can explore Semantic Kernel
as a potential alternative to LangChain.

Scenario details
This architecture uses an AI/machine learning pipeline, LangChain, and language models
to create a comprehensive analysis of how your product compares to similar competitor
products. The pipeline consists of two main components: a batch pipeline and a real-
time, asynchronous pipeline. When you send a query to the real-time pipeline, the
orchestrator language model, often GPT-4 or the most powerful available language
model, derives a set of tasks to answer your question. These subtasks invoke other
language models and APIs to mine the internal company product database and the
public internet to build a report that shows the competitive position of your products
versus the competitor products.

Potential use cases


You can apply this solution to the following scenarios:

Compare internal company product information that has an internal knowledge


base to competitor products that has information that's retrieved from a Bing web
search.
Perform a document search and information retrieval.
Create a chatbot for internal use that has an internal knowledge base and is also
enhanced by an external web search.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Brandon Cowen | Senior Specialized AI Cloud Solution Architect

Other contributor:

Ashish Chauhun | Senior Specialized AI Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Next steps
Azure Business Process Accelerator
Azure OpenAI
Azure OpenAI embeddings QnA
ChatGPT
Enterprise search with OpenAI architecture
Generative AI for developers: Exploring new tools and APIs in Azure OpenAI
Service
LangChain
Memory with language models
Quickstart: Get started generating text using Azure OpenAI Service
Redis on Azure OpenAI
Revolutionize your enterprise data with ChatGPT: Next-gen apps with Azure
OpenAI and Azure Cognitive Search
Semantic Kernel
Vector databases with Azure OpenAI

Related resources
AI architecture design
Batch processing
Types of language API services
Implement custom speech-to-text
Azure AI services Azure AI Speech Azure Machine Learning

This two-part guide describes various approaches for efficiently implementing high-
quality speech-aware applications. It focuses on extending and customizing the baseline
model of speech-to-text functionality that's provided by the Azure Cognitive Services
Speech service.

This article describes the problem space and decision-making process for designing
your solution. The second article, Deploy a custom speech-to-text solution, provides a
use case for applying these instructions and recommended practices.

The pre-built and custom AI spectrum


The pre-built and custom AI spectrum represents multiple AI model customization and
development effort tiers, ranging from ready-to-use pre-built models to fully
customized AI solutions.

On the left side of the spectrum, Azure Cognitive Services enables a quick and low-
friction implementation of AI capabilities into applications via pre-trained models.
Microsoft curates extensive datasets to train and build these baseline models. As a
result, you can use baseline models with no additional training data. They're consumed
via enhanced-security programmatic API calls.
Cognitive Services includes:

Speech. Speech-to-text, text-to-speech, speech translation, and speaker


recognition
Language. Entity recognition, sentiment analysis, question answering,
conversational language understanding, and translator
Vision. Computer vision and Face API
Decision. Anomaly detector, content moderator, and Personalizer
OpenAI Service. Advanced language models

When the pre-built baseline models don't perform accurately enough on your data, you
can customize them by adding training data that's relative to the problem domain. This
customization requires the extra effort of gathering adequate data to train and evaluate
an acceptable model. Cognitive Services that are customizable include Custom Vision,
Custom Translator, Custom Speech, and CLU. Extending pre-built Cognitive Services
models is in the center of the spectrum. Most of this article is focused on that central
area.

Alternatively, when models and training data focus on a specific scenario and require a
proprietary training dataset, Azure Machine Learning provides custom solution
resources, tools, compute, and workflow guidance to support building entirely custom
models. This scenario appears on the right side of the spectrum. These models are built
from scratch. Developing a model by using Azure Machine Learning typically ranges
from using visual tools like AutoML to programmatically developing the model by using
notebooks.

Azure Speech service


Azure Speech service unifies speech-to-text, text-to-speech, speech translation, voice
assistant, and speaker recognition functionality into a single subscription that's based on
Cognitive Services. You can enable an application for speech by integrating with Speech
service via easy-to-use SDKs and APIs.

The Azure speech-to-text service analyzes audio in real time or asynchronously to


transcribe the spoken word into text. Out of the box, Azure speech-to-text uses a
Universal Language Model as a baseline that reflects commonly used spoken language.
This baseline model is pre-trained with dialects and phonetics that represent a variety of
common domains. As a result, consuming the baseline model requires no extra
configuration and works well in most scenarios.

Note, however, that the baseline model might not be sufficient if the audio contains
ambient noise or includes a lot of industry and domain-specific jargon. In these cases,
building a custom speech model makes sense. You do that by training with additional
data that's associated with the specific domain.

Depending on the size of the custom domain, it might also make sense to train multiple
models and compartmentalize a model for an individual application. For example,
Olympics commentators report on various sports, each with its own jargon. Because
each sport has a vocabulary that differs significantly from the others, building a custom
model specific to a sport increases accuracy by limiting the utterance data relative to
that particular sport. As a result, the model can learn from a precise and targeted set of
data.

So there are three approaches to implementing Azure speech-to-text:

The baseline model is appropriate when the audio is clear of ambient noise and
the transcribed speech consists of commonly spoken language.
A custom model augments the baseline model to include domain-specific
vocabulary that's shared across all areas of the custom domain.
Multiple custom models make sense when the custom domain has numerous
areas, each with a specific vocabulary.

Potential use cases


Here are some generic scenarios and use cases in which custom speech-to-text is
helpful:

Speech transcription for a specific domain, like medical transcription or call center
transcription
Live transcription, as in an app or to provide captions for live video streaming

Microsoft SDKs and open-source tools


When you're working with speech-to-text, you might find these resources helpful:

Azure Speech SDK


Speech Studio
FFMpeg / SOX

Design considerations
This section describes some design considerations for building a speech-based
application.

Baseline model vs. custom model


Azure Speech includes baseline models that support various languages. These models
are pre-trained with a vast amount of vocabulary and domains. However, you might
have a specialized vocabulary that needs recognition. In these situations, baseline
models might fall short. The best way to determine if the base model will suffice is to
analyze the transcription that's produced from the baseline model and compare it to a
human-generated transcript for the same audio. The deployment article in this guide
describes using Speech Studio to compare the transcripts and obtain a word error rate
(WER) score. If there are multiple incorrect word substitutions in the results, we
recommend that you train a custom model to recognize those words.

One vs. many custom models


If your scenario will benefit from a custom model, you next need to determine how
many models to build. One model is typically sufficient if the utterances are closely
related to one area or domain. However, multiple models are best if the vocabulary is
significantly different across the domain areas. In this scenario, you also need a variety
of training data.

Let's return to Olympics example. Say you need to include the transcription of audio
commentary for multiple sports, including ice hockey, luge, snowboarding, alpine skiing,
and more. Building a custom speech model for each sport will improve accuracy
because each sport has unique terminology. However, each model must have diverse
training data. It's too restrictive and inextensible to create a model for each
commentator for each sport. A more practical approach is to build a single model for
each sport but include audio from a group of that includes commentators with different
accents, of both genders, and of various ages. All domain-specific phrases related to the
sport as captured by the diverse commentators reside in the same model.
You also need to consider which languages and locales to support. It might make sense
to create these models by locale.

Acoustic and language model adaptation


Azure Speech provides three options for training a custom model:

Language model adaptation is the most commonly used customization. A language


model helps to train how certain words are used together in a particular context or a
specific domain. Building a language model is also relatively easy and fast. First, train the
model by supplying a variety of utterances and phrases for the particular domain. For
example, if the goal is to generate transcription for alpine skiing, collect human-
generated transcripts of multiple skiing events. Clean and combine them to create one
training data file with about 50 thousand phrases and sentences. For more details about
the data requirements for custom language model training, see Training and testing
datasets.

Pronunciation model customization is also one of the most commonly used


customizations. A pronunciation model helps the custom model recognize uncommon
words that don't have a standard pronunciation. For example, some of the terminology
in alpine skiing borrows from other languages, like the terms schuss and mogul. These
words are excellent candidates for training with a pronunciation dataset. For more
details about improving recognition by using a pronunciation file, see Pronunciation
data for training. For details about building a custom model by using Speech Studio, see
What is Custom Speech?.

Acoustic model adaptation provides phonetic training on the pronunciation of certain


words so that Azure Speech can properly recognize them. To build an acoustic model,
you need audio samples and accompanying human-generated transcripts. If the
recognition language matches common locales, like en-US, using the current baseline
model should be sufficient. Baseline models have diverse training that uses the voices of
native and non-native English speakers to cover a vast amount of English vocabulary.
Therefore, building an acoustic model adaptation on the en-US base model might not
provide much improvement. Training a custom acoustic model also takes a bit more
time. For more information about the data requirements for custom acoustic training,
see Training and testing datasets.

The final custom model can include datasets that use a combination of all three of the
customizations described in this section.

Training a custom model


There are two approaches to training a custom model:

Train with numerous examples of phrases and utterances from the domain. For
example, include transcripts of cleaned and normalized alpine skiing event audio
and human-generated transcripts of previous events. Be sure that the transcripts
include the terms used in alpine skiing and multiple examples of how
commentators pronounce them. If you follow this process, the resulting custom
model should be able to recognize domain-specific words and phrases.

Train with specific data that focuses on problem areas. This approach works well
when there isn't much training data, for example, if new slang terms are used
during alpine skiing events and need to be included in the model. This type of
training uses the following approach:
Use Speech Studio to generate a transcription and compare it with human-
generated transcriptions.
Identify problem areas from patterns in what the commentators say. Identify:
The contexts within which the problem word or utterance is applied.
Different inflections and pronunciations of the word or utterance.
Any unique commentator-specific applications of the word or utterance.

Training a custom model with specific data can be time-consuming. Steps include
carefully analyzing the transcription gaps, manually adding training phrases, and
repeating this process multiple times. However, in the end, this approach provides
focused training for the problem areas that were previously incorrectly transcribed. And
it's possible to iteratively build this model by selectively training on critical areas and
then proceeding down the list in order of importance. Another benefit is that the
dataset size will include a few hundred utterances rather than a few thousand, even after
many iterations of building the training data.

After you build your model


After you build your model, keep the following recommendations in mind:

Be aware of the difference between lexical text and display text. Speech Studio
produces WER based on lexical text. However, what the user sees is the display text
with punctuation, capitalization, and numerical words represented as numbers.
Following is an example of lexical text versus display text.

Lexical text: the speed is great and the time is even better fifty seven oh six three
seconds for the german

Display text: The speed is great. And that time is even better. 57063 seconds for
the German.
What's expected (implied) is: The speed is great. And that time is even better.
57.063 seconds for the German

The custom model has a low WER rate, but that doesn't mean that user-perceived
error rate (errors in display text) is low. This problem occurs mainly in alphanumeric
input because different applications can have alternative ways of representing the
input. You shouldn't rely only on the WER. You also need to review the final
recognition result.

When display text seems wrong, review the detailed recognition result from the
SDK, which includes lexical text, in which everything is spelled out. If the lexical text
is correct, the recognition is accurate. You can then resolve inaccuracies in the
display text (the final recognized result) by adding post-processing rules.

Manage datasets, models, and their versions. In Speech Studio, when you create
projects, datasets, and models, there are only two fields: name and description.
When you build datasets and models iteratively, you need to follow a good
naming and versioning scheme to make it easy to identify the contents of a
dataset and which model reflects which version of the dataset. For more details
about this recommendation, see Deploy a custom speech-to-text solution.

Go to part two of this guide: deployment

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Pratyush Mishra | Principal Engineering Manager

Other contributors:

Mick Alberts | Technical Writer


Rania Bayoumy | Senior Technical Program Manager

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is Custom Speech?
What is text-to-speech?
Train a Custom Speech model
Deploy a custom speech-to-text solution

Related resources
Artificial intelligence (AI) architecture design
Use a speech-to-text transcription pipeline to analyze recorded conversations
Control IoT devices with a voice assistant app
Deploy a custom speech-to-text
solution
Azure AI services Azure AI Speech Azure Machine Learning

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .

This article is an implementation guide and example scenario that provides a sample
deployment of the solution that's described in Implement custom speech-to-text:

Go to part one of this guide

Architecture

Download a Visio file of this architecture.


Workflow
1. Collect existing transcripts to use to train a custom speech model.
2. If the transcripts are in WebVTT or SRT format, clean the files so that they include
only the text portions of the transcripts.
3. Normalize the text by removing any punctuation, separating repeated words, and
spelling out any large numerical values. You can combine multiple cleaned-up
transcripts into one to create one dataset. Similarly, create a dataset for testing.
4. After the datasets are ready, upload them by using Speech Studio. Alternatively, if
the dataset is in a blob store, you can use Azure Speech-to-text API and the
Speech CLI. In the API and the CLI, you can pass the dataset's URI an input to
create a dataset for model training and testing.
5. In Speech Studio or via the API or CLI, use the new dataset to train a custom
speech model.
6. Evaluate the newly trained model against the test dataset.
7. If the performance of the custom model meets your quality expectations, publish it
for use in speech transcription. Otherwise, use Speech Studio to review the word
error rate (WER) and specific error details and determine what additional data is
needed for training.
8. Use the APIs and CLI to help operationalize the model building, evaluation, and
deployment process.

Components
Azure Machine Learning is an enterprise-grade service for the end-to-end
machine learning lifecycle.
Azure Cognitive Services is a set of APIs, SDKs, and services that can help you
make your applications more intelligent, engaging, and discoverable.
Speech Studio is a set of UI-based tools for building and integrating features
from Cognitive Services Speech service into your applications. Here, it's one
alternative for training datasets. It's also used to review training results.
Speech-to-text REST API is an API that you can use to upload your own data,
test and train a custom model, compare accuracy between models, and deploy
a model to a custom endpoint. You can also use it to operationalize your model
creation, evaluation, and deployment.
Speech CLI is a command-line tool for using Speech service without having to
write any code. It provides another alternative for creating and training datasets
and for operationalizing your processes.

Scenario details
This article is based on the following fictional scenario:

Contoso, Ltd., is a broadcast media company that airs broadcasts and commentary on
Olympics events. As part of the broadcast agreement, Contoso provides event
transcription for accessibility and data mining.

Contoso wants to use the Azure Speech service to provide live subtitling and audio
transcription for Olympics events. Contoso employs female and male commentators
from around the world who speak with diverse accents. In addition, each individual sport
has specific terminology that can make transcription difficult. This article describes the
application development process for this scenario: providing subtitles for an application
that needs to deliver accurate event transcription.

Contoso already has these required prerequisite components in place:

Human-generated transcripts for previous Olympics events. The transcripts


represent commentaries from different sports and diverse commentators.
An Azure Cognitive Service resource. You can create one on the Azure portal .

Develop a custom speech-based application


A speech-based application uses the Azure Speech SDK to connect to the Azure Speech
service to generate text-based audio transcription. Speech service supports various
languages and two fluency modes: conversational and dictation. To develop a custom
speech-based application, you generally need to complete these steps:

1. Use Speech Studio, Azure Speech SDK, Speech CLI, or the REST API to generate
transcripts for spoken sentences and utterances.
2. Compare the generated transcript with the human-generated transcript.
3. If certain domain-specific words are transcribed incorrectly, consider creating a
custom speech model for that specific domain.
4. Review various options for creating custom models. Decide whether one or many
custom models will work better.
5. Collect training and testing data.
6. Ensure the data is in an acceptable format.
7. Train, test and evaluate, and deploy the model.
8. Use the custom model for transcription.
9. Operationalize the model building, evaluation, and deployment process.

Let's look more closely at these steps:

1. Use Speech Studio, Azure Speech SDK, Speech CLI, or the REST API to generate
transcripts for spoken sentences and utterances
Azure Speech provides SDKs, a CLI interface, and a REST API for generating transcripts
from audio files or directly from microphone input. If the content is in an audio file, it
needs to be in a supported format. In this scenario, Contoso has previous event
recordings (audio and video) in .avi files. Contoso can use tools like FFmpeg to extract
audio from the video files and save it in a format that's supported by the Azure Speech
SDK, like .wav.

In the following code, the standard PCM audio codec, pcm_s16le , is used to extract
audio in a single channel (mono) that has a sampling rate of 8 KHz.

ffmpeg.exe -i INPUT_FILE.avi -acodec pcm_s16le -ac 1 -ar 8000


OUTPUT_FILE.wav

2. Compare the generated transcript with the human-generated transcript

To perform the comparison, Contoso samples commentary audio from multiple sports
and uses Speech Studio to compare the human-generated transcript with the results
transcribed by Azure Speech service. The Contoso human-generated transcripts are in a
WebVTT format. To use these transcripts, Contoso cleans them up and generates a
simple .txt file that has normalized text without the timestamp information.

For information about using Speech Studio to create and evaluate a dataset, see
Training and testing datasets.

Speech Studio provides a side-by-side comparison of the human-generated transcript


and the transcripts produced from the models selected for comparison. Test results
include a WER for the models, as shown here:

Model Error Insertion Substitution Deletion


rate

Model 1: 20211030 14.69% 6 (2.84%) 22 (10.43%) 3


(1.42%)

Model 2: 6.16% 3 (1.42%) 8 (3.79%) 2


Olympics_Skiing_v6 (0.95%)

For more information about WER, see Evaluate word error rate.

Based on these results, the custom model (Olympics_Skiing_v6) is better than the base
model (20211030) for the dataset.
Note the Insertion and Deletion rates, which indicate that the audio file is relatively
clean and has low background noise.

3. If certain domain-specific words are transcribed incorrectly, consider creating a


custom speech model for that specific domain

Based on the results in the preceding table, for the base model, Model 1: 20211030,
about 10 percent of the words are substituted. In Speech Studio, use the detailed
comparison feature to identify domain-specific words that are missed. The following
table shows one section of the comparison.

Human-generated Model 1 Model 2


transcript

olympic champion to go olympic champion to go olympic champion to go


back to back in the back to back in the back to back in the
downhill since nineteen downhill since nineteen downhill since nineteen
ninety eight the great katja ninety eight the great ninety eight the great
seizinger of germany what catch a sizing are of katja seizinger of germany
ninety four and ninety germany what ninety four what ninety four and
eight and ninety eight ninety eight

she has dethroned the she has dethroned the she has dethroned the
olympic champion goggia olympic champion georgia olympic champion goggia

Model 1 doesn't recognize domain-specific words like the names of the athletes "Katia
Seizinger" and "Goggia." However, when the custom model is trained with data that
includes the athletes' names and other domain-specific words and phrases, it's able to
learn and recognize them.

4. Review various options for creating custom models. Decide whether one or many
custom models will work better

By experimenting with various ways to build custom models, Contoso found that they
could achieve better accuracy by using language and pronunciation model
customization. (See the first article in this guide.) Contoso also noted minor
improvements when they included acoustic (original audio) data for building the custom
model. However, the benefits weren't significant enough to make it worth maintaining
and training for a custom acoustic model.

Contoso found that creating separate custom language models for each sport (one
model for alpine skiing, one model for luge, one model for snowboarding, and so on)
provided better recognition results. They also noted that creating separate acoustic
models based on the type of sport to augment the language models wasn't necessary.

5. Collect training and testing data

The Training and testing datasets article provides details about collecting the data
needed for training a custom model. Contoso collected transcripts for various Olympics
sports from diverse commentators and used language model adaptation to build one
model per sport type. However, they used one pronunciation file for all custom models
(one for each sport). Because the testing and training data are kept separate, after a
custom model was built, Contoso used event audio whose transcripts weren't included
in the training dataset for model evaluation.

6. Ensure the data is in an acceptable format

As described in Training and testing datasets, datasets that are used to create a custom
model or to test the model need to be in a specific format. Contoso's data is in WebVTT
files. They created some simple tools to produce text files that contain normalized text
for language model adaptation.

7. Train, test and evaluate, and deploy the model

New event recordings are used to further test and evaluate the trained model. It can
take a couple of iterations of testing and evaluation to fine-tune a model. Finally, when
the model generates transcripts that have acceptable error rates, it's deployed
(published) to be consumed from the SDK.

8. Use the custom model for transcription

After the custom model is deployed, you can use the following C# code to use the
model in the SDK for transcription:

C#

String endpoint = "Endpoint ID from Speech Studio";


string locale = "en-US";
SpeechConfig config = SpeechConfig.FromSubscription(subscriptionKey:
speechKey, region: region);
SourceLanguageConfig sourceLanguageConfig =
SourceLanguageConfig.FromLanguage(locale, endPoint);
recognizer = new SpeechRecognizer(config, sourceLanguageConfig, audioInput);

Notes about the code:

endpoint is the endpoint ID of the custom model that's deployed in step 7.


subscriptionKey and region are the Azure Cognitive Services subscription key and

region. You can get these values from the Azure portal by going to the resource
group where the Cognitive Services resource was created and looking at its keys.

9. Operationalize the model building, evaluation, and deployment process

After the custom model is published, it needs to be evaluated regularly and updated if
new vocabulary is added. Your business might evolve, and you might need more custom
models to increase coverage for more domains. The Azure Speech team also releases
new base models, which are trained on more data, as they become available.
Automation can help you keep up with these changes. The next section of this article
provides more details about automating the preceding steps.

Deploy this scenario


For information about how to use scripting to streamline and automate the entire
process of creating datasets for training and testing, building and evaluating models,
and publishing new models as needed, see custom-speech-stt on GitHub .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Pratyush Mishra | Principal Engineering Manager

Other contributors:

Mick Alberts | Technical Writer


Rania Bayoumy | Senior Technical Program Manager

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is Custom Speech?
What is text-to-speech?
Train a Custom Speech model
Implement custom speech-to-text
Azure/custom-speech-stt on GitHub
Related resources
Artificial intelligence (AI) architecture design
Use a speech-to-text transcription pipeline to analyze recorded conversations
Control IoT devices with a voice assistant app
Implement custom speech-to-text
Conversation summarization
Azure AI services

Most businesses provide customer service support to help customers with product
queries, troubleshooting, and maintaining or upgrading features or the product itself. To
provide a satisfactory resolution, customer support specialists need to respond quickly
with accurate information. OpenAI can help organizations with customer support in a
variety of ways.

This guide describes how to generate summaries of customer-agent interactions by


using the Azure OpenAI GPT-3 model. It contains an end-to-end sample architecture
that illustrates the key components involved in getting a summary of a text input. The
generation of the text input is outside the scope of this guide. The focus of this guide is
to describe the process of implementing the summarization of a set of sample agent-
customer conversations and analyze the outcomes of various approaches to
summarization.

Conversation scenarios
Self-service chatbots (fully automated). In this scenario, customers can interact
with a chatbot that's powered by GPT-3 and trained on industry-specific data. The
chatbot can understand customer questions and answer appropriately based on
responses learned from a knowledge base.
Chatbot with agent intervention (semi-automated). Questions posed by
customers are sometimes complex and necessitate human intervention. In such
cases, GPT-3 can provide a summary of the customer-chatbot conversation and
help the agent with quick searches for additional information from a large
knowledge base.
Summarizing transcripts (fully automated or semi-automated). In most customer
support centers, agents are required to summarize conversations for record
keeping, future follow-up, training, and other internal processes. GPT-3 can
provide automated or semi-automated summaries that capture salient details of
conversations for further use.

This guide focuses on the process for summarizing transcripts by using Azure OpenAI
GPT-3.

On average, it takes an agent 5 to 6 minutes to summarize a single agent-customer


conversation. Given the high volumes of requests service teams handle on any given
day, this additional task can overburden the team. OpenAI is a good way to help agents
with summarization-related activities. It can improve the efficiency of the customer
support process and provide better precision. Conversation summarization can be
applied to any customer support task that involves agent-customer interaction.

Conversation summarization service


Conversation summarization is suitable in scenarios where customer support
conversations follow a question-and-answer format.

Some benefits of using a summarization service are:

Increased efficiency: It allows customer service agents to quickly summarize


customer conversations, eliminating the need for long back-and-forth exchanges.
This efficiency helps to speed up the resolution of customer problems.
Improved customer service: Agents can use summaries of conversations in future
interactions to quickly find the information needed to accurately resolve customer
concerns.
Improved knowledge sharing: Conversation summarization can help customer
service teams share knowledge with each other quickly and effectively. It equips
customer service teams with better resolutions and helps them provide faster
support.

Architecture
A typical architecture for a conversation summarizer has three main stages: pre-
processing, summarization, and post-processing. If the input contains a verbal
conversation or any form of speech, the speech needs to be transcribed to text. For
more information, see Azure Speech-to-text service .

Here's a sample architecture:


Download a PowerPoint file of this architecture.

Workflow
1. Gather input data: Feed relevant input data into the pipeline. If the source is an
audio file, you need to convert it to text by using a TTS service like Azure text-to-
speech.
2. Pre-process the data: Remove confidential information and any unimportant
conversation from the data.
3. Feed the data into the summarizer: Pass the data in a prompt via Azure OpenAI
APIs. In-context learning models include zero-shot, few-shot, or a custom model.
4. Generate a summary: The model generates a summary of the conversation.
5. Post-process the data: Apply a profanity filter and various validation checks to the
summary. Add sensitive or confidential data that was removed during the pre-
process step back into the summary.
6. Evaluate the results: Review and evaluate the results. This step can help you
identify areas where the model needs to be improved and find errors.

The following sections provide more details about the three main stages.

Pre-process
The goal of pre-processing is to ensure that the data provided to the summarizer service
is relevant and doesn't include sensitive or confidential information.
Here are some pre-processing steps that can help condition your raw data. You might
need to apply one or many steps, depending on the use case.

Remove personally identifiable information (PII). You can use the Conversational
PII API (preview) to remove PII from transcribed or written text. This example shows
the output after the API has removed PII:

Document text: Parker Doe has repaid all of their loans as of


2020-04-25. Their SSN is 999-99-9999. To contact them, use
their phone number 555-555-0100. They are originally from
Brazil and have Brazilian CPF number 998.214.865-68
Redacted document text: ******* has repaid all of their
loans as of *******. Their SSN is *******. To contact
them, use their phone number *******. They are originally from
Brazil and have Brazilian CPF number 998.214.865-68

...Entity 'Parker Doe' with category 'Person' got redacted


...Entity '2020-04-25' with category 'DateTime' got redacted
...Entity '999-99-9999' with category 'USSocialSecurityNumber' got
redacted
...Entity '555-555-0100' with category 'PhoneNumber' got redacted

Remove extraneous information. Customer agents start conversations with casual


exchanges that don't include relevant information. A trigger can be added to a
conversation to identify the point where the concern or relevant question is first
addressed. Removing that exchange from the context can improve the accuracy of
the summarizer service because the model is then fine-tuned on the most relevant
information in the conversation. The Curie GPT-3 engine is a popular choice for
this task because it's trained extensively, via content from the internet, to identify
this type of casual conversation.

Remove excessively negative conversations. Conversations can also include


negative sentiments from unhappy customers. You can use Azure content-filtering
methods like Azure Content Moderator to remove conversations that contain
sensitive information from analysis. Alternatively, OpenAI offers a moderation
endpoint, a tool that you can use to check whether content complies with
OpenAI's content policies.

Summarizer
OpenAI's text-completion API endpoint is called the completions endpoint. To start the
text-completion process, it requires a prompt. Prompt engineering is a process used in
large language models. The first part of the prompt includes natural language
instructions and/or examples of the specific task requested (in this scenario,
summarization). Prompts allow developers to provide some context to the API, which
can help it generate more relevant and accurate text completions. The model then
completes the task by predicting the most probable next text. This technique is known
as in-context learning.

7 Note

Extractive summarization attempts to identify and extract salient information from a


text and group it to produce a concise summary without understanding the
meaning or context.

Abstractive summarization rewrites a text by first creating an internal semantic


representation and then creating a summary by using natural language processing.
This process involves paraphrasing.

There are three main approaches for training models for in-context learning: zero-shot,
few-shot and fine-tuning. These approaches vary based on the amount of task-specific
data that's provided to the model.

Zero-shot: In this approach, no examples are provided to the model. The task
request is the only input. In zero-shot learning, the model relies on data that GPT-3
is already trained on (almost all available data from the internet). It attempts to
relate the given task to existing categories that it has already learned about and
responds accordingly.

Few-shot: When you use this approach, you include a small number of examples in
the prompt that demonstrate the expected answer format and the context. The
model is provided with a very small amount of training data, typically just a few
examples, to guide its predictions. Training with a small set of examples enables
the model to generalize and understand related but previously unseen tasks.
Creating these few-shot examples can be challenging because they need to clarify
the task you want the model to perform. One commonly observed problem is that
models, especially small ones, are sensitive to the writing style that's used in the
training examples.

The main advantages of this approach are a significant reduction in the need for
task-specific data and reduced potential to learn an excessively narrow distribution
from a large but narrow fine-tuning dataset.

With this approach, you can't update the weights of the pretrained model.

For more information, see Language Models are few-shot learners .


Fine-tuning: Fine-tuning is the process of tailoring models to get a specific desired
outcome from your own datasets. It involves retraining models on new data. For
more information, see Learn how to customize a model for your application.

You can use this customization step to improve your process by:
Including a larger set of example data.
Using traditional optimization techniques with backpropagation to readjust the
weights of the model. These techniques enable higher quality results than the
zero-shot or few-shot approaches provide by themselves.
Improving the few-shot learning approach by training the model weights with
specific prompts and a specific structure. This technique enables you to achieve
better results on a wider number of tasks without needing to provide examples
in the prompt. The result is less text sent and fewer tokens.

Disadvantages include the need for a large new dataset for every task, the
potential for poor generalization out of distribution, and the possibility to exploit
spurious features of the training data, resulting in high chances of unfair
comparison with human performance.

Creating a dataset for model customization is different from designing prompts for
use with the other models. Prompts for completion calls often use either detailed
instructions or few-shot learning techniques and consist of multiple examples. For
fine-tuning, we recommend that each training example consists of a single input
example and its desired output. You don't need to provide detailed instructions or
examples in the prompt.

As you increase the number of training examples, your results improve. We


recommend including at least 500 examples. It's typical to use between thousands
and hundreds of thousands of labeled examples. Testing indicates that each
doubling of the dataset size leads to a linear increase in model quality.

This guide demonstrates the curie-instruct/text-curie-001 and davinci-instruct/text-


davinci-001 engines. These engines are frequently updated. The version you use might
be different.

Post-process
We recommend that you check the validity of the results that you get from GPT-3.
Implement validity checks by using a programmatic approach or classifiers, depending
on the use case. Here are some critical checks:

Verify that no significant points are missed.


Check for factual inaccuracies.
Check for any bias introduced by the training data used on the model.
Verify that the model doesn't change text by adding new ideas or points. This
problem is known as hallucination.
Check for grammatical and spelling errors.
Use a content profanity filter like Content Moderator to ensure that no
inappropriate or irrelevant content is included.

Finally, reintroduce any vital information that was previously removed from the
summary, like confidential information.

In some cases, a summary of the conversation is also sent to the customer, along with
the original transcript. In these cases, post-processing involves appending the transcript
to the summary. It can also include adding lead-in sentences like "Please see the
summary below."

Considerations
It's important to fine-tune your base models with an industry-specific training dataset
and change the size of available datasets. Fine-tuned models perform best when the
training data includes at least 1,000 data points and the ground truth (human-generated
summaries) used to train the models is of high quality.

The tradeoff is cost. The process of labeling and cleaning datasets can be expensive. To
ensure high-quality training data, you might need to manually inspect ground truth
summaries and rewrite low-quality summaries. Consider the following points about the
summarization stage:

Prompt engineering: When provided with little instruction, Davinci often performs
better than other models. To optimize results, experiment with different prompts
for different models.
Token size: A summarizer that's based on GPT-3 is limited to a total of 4,098
tokens, including the prompt and completion. To summarize larger passages,
separate the text into parts that conform to these constraints. Summarize each part
individually and then collect the results in a final summary.
Garbage in, garbage out: Trained models are only as good as the training data that
you provide. Be sure that the ground truth summaries in the training data are well
suited to the information that you eventually want to summarize in your dialogs.
Stopping point: The model stops summarizing when it reaches a natural stopping
point or a stop sequence that you provide. Test this parameter to choose among
multiple summaries and to check whether summaries look incomplete.
Example scenario: Summarizing transcripts in
call centers
This scenario demonstrates how the Azure OpenAI summarization feature can help
customer service agents with summarization tasks. It tests the zero-shot, few-shot, and
fine-tuning approaches and compares the results against human-generated summaries.

The dataset used in this scenario is a set of hypothetical conversations between


customers and agents in the Xbox customer support center about various Xbox
products and services. The hypothetical chat is labeled with Prompt. The human-written
abstractive summary is labeled with Completion.

Prompt Completion

Customer: Question on XAIL Customer wants to know if they need to


sign up for preview rings to join Xbox
Agent: Hello! How can I help you today? Accessibility Insider League. Agent
responds that it is not mandatory, but that
Customer: Hi, I have a question about the some experiences may require it.
Accessibility insider ring

Agent: Okay. I can certainly assist you with


that.

Customer: Do I need to sign up for the


preview ring to join the accessibility league?

Agent: No. You can leave your console out


of Xbox Preview rings and still join the
League. However, note that some
experiences made available to you may
require that you join an Xbox Preview ring.

Customer: Okay. And I can just sign up for


preview ring later yeah?

Agent: That is correct.

Customer: Sweet.

Ideal output. The goal is to create summaries that follow this format: "Customer said x.
Agent responded y." Another goal is to capture salient features of the dialog, like the
customer complaint, suggested resolution, and follow-up actions.
Here's an example of a customer support interaction, followed by a comprehensive
human-written summary of it:

Dialog

Customer: Hello. I have a question about the game pass.

Agent: Hello. How are you doing today?

Customer: I'm good.

Agent. I see that you need help with the Xbox Game Pass.

Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.

Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off (or the best available
discounted price) to continue playing a game once it leaves the catalog.

Customer: Got it, thanks

Ground truth summary

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Zero-shot
The zero-shot approach is useful when you don't have ample labeled training data. In
this case, there aren't enough ground truth summaries. It's important to design prompts
carefully to extract relevant information. The following format is used to extract general
summaries from customer-agent chats:

prefix = "Please provide a summary of the conversation below: "

suffix = "The summary is as follows: "

Here's a sample that shows how to run a zero-shot model:

Python
rouge = Rouge()
# Run zero-shot prediction for all engines of interest
deploymentNames = ["curie-instruct","davinci-instruct"] # also known as
text-davinci/text-instruct
for deployment in deploymentNames:
url = openai.api_base + "openai/deployments/" + deployment + "/completions?
api-version=2022-12-01-preivew"
response_list = []
rouge_list = []
print("calling…" + deployment)
for i in range(len(test)):
response_i = openai.Completion.create(
engine = deployment,
prompt = build_prompt(prefix, [test['prompt'][i]], suffix),
temperature = 0.0,
max_tokens = 400,
top_p = 1.0,
frequence_penalty = 0.5,
persence_penalty = 0.0,
stop=["end"] # We recommend that you adjust the stop sequence based on the
dataset
)
scores = rouge.get_scores(normalize_text(response_i[‘choices’][ 0]
['text']),test['completion'][i])
rouge_list += [scores[0]['rouge-1']['f']],
response_list += [response_i]
summary_list = [normalize_text(i[‘choices’][0][‘text’]) for i in
response_list]
test[deployment + "_zeroshotsummary"] = summary_list
test[deployment + "_zeroshotroguescore"] = rouge_list

Results and observations


The zero-shot model's output is produced directly from the base model. In this case,
both Curie and Davinci summarize the dialog fairly well. The only noticeable difference is
that the Curie model provides a little less detail. Curie starts the summary with
"customer asks the agent about the Xbox game pass." The corresponding sentence in
Davinci is "customer asked how long they could access games after they leave the Xbox
game pass catalog."

Dialog

Customer: Hello. I have a question about the game pass.

Agent: Hello. How are you doing?

Customer: I'm good.

Agent: I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.

Agent: Once a game leaves the Xbox game pass catalog you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.

Ground truth

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Davinci result

The customer asked how long they could access games after they leave the Xbox game
pass catalog. The agent told them that they would need to purchase a digital copy of
the game from the Xbox app for windows or the Microsoft store in order to continue
playing it. The agent also reminded the customer that they would be notified prior to a
game leaving the Xbox game pass catalog. as a member the customer could purchase
any game in the catalog for up to 20 off or the best available discounted price.

Curie result

The customer asks the agent about the Xbox game pass. the agent tells the customer
that once a game leaves the Xbox game pass catalog the customer will need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game. The agent
also reminds the customer that Xbox will notify members prior to a game leaving the
Xbox game pass catalog.

Few-shot
When you use the few-shot approach, the model is provided with a small number of
examples.

context_primer = "Below are examples of conversations and their corresponding

summaries:"

prefix = "Please provide a summary of the conversation below: "


suffix = "The summary is as follows: "

Here's a sample that shows how to run a few-shot model:

Python

train_small = train[]
train_small_json = train_small.to_dict(orient='records')
compiled_train_prompt = build_prompt_fewshot(prefix,context_primer,
train_small_json, suffix)

for deployment in deploymentNames:


url = openai.api_base + "openai/deployments/" + deployment + "/completions?
api-version=2022-12-01-preivew"
response_list = []
rouge_list = []
print("calling…" + deployment)
for i in range(len(test)):
response_i = openai.Completion.create(
engine = deployment,
prompt = compiled_train_prompt+build_prompt(prefix, [test['prompt'][i]],
suffix),
temperature = 0.0,
max_tokens = 400,
top_p = 1.0,
frequence_penalty = 0.5,
persence_penalty = 0.0,
stop=["end"] # We recommend that you adjust the stop sequence based on the
dataset
)
scores = rouge.get_scores(normalize_text(response_i['choices'][ 0]
['text']),test['completion'][i])
rouge_list += [scores[0]['rouge-1']['f']],
response_list += [response_i]
summary_list = [normalize_text(i['choices'][0]['text']) for i in
response_list]
test[deployment + "_fewshot"] = summary_list
test[deployment + "_FSscore1"] = rouge_list

Results and observations


With the few-shot approach, the summaries continue to capture salient features of the
conversation. The Davinci summary is more compact and closer to the ground truth.
Curie fabricates some trivial details.

Dialog

Customer: Hello. I have a question about the game pass.

Agent: Hello. How are you doing?


Customer: I'm good.

Agent: I see that you need help with the Xbox Game Pass.

Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.

Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.

Ground truth

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Davinci result

customer wanted to know how long they could access games after they leave game
pass. Agent informs that once a game leaves the Xbox game pass catalog the customer
would need to purchase a digital copy or obtain another form of entitlement to
continue playing the game.

Curie result

customer has a question about the game pass. customer is good. agent needs help with
the Xbox game pass. customer asks how long they can access the games after they
leave the game pass catalog. Agent informs that once a game leaves the Xbox game
pass catalog the customer will need to purchase a digital copy from the Xbox app for
windows or the Microsoft store play from a disc or obtain another form of entitlement
to continue playing the game. customer is happy to hear this and thanks agent.

Fine-tuning
Fine-tuning is the process of tailoring models to get a specific desired outcome from
your own datasets.

Here's an example format:


{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

Results and observations


Testing suggests that a fine-tuned Curie model leads to results that are comparable to
those of a Davinci few-shot model. Both summaries capture the customer's question
and the agent's answer without capturing the details about discounts and without
adding content. Both summaries are similar to the ground truth.

Dialog

Customer: Hello. I have a question about the game pass.

Agent: Hello. How are you doing?

Customer: I'm good.

Agent: I see that you need help with the Xbox Game Pass.

Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.

Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.

Ground truth

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Curie result

customer wants to know how long they can access the games after they leave game
pass. agent explains that once a game leaves the Xbox game pass catalog they'll need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game.
Conclusions
Generally, the Davinci model requires fewer instructions to perform tasks than other
models, such as Curie. Davinci is better suited for summarizing text that requires an
understanding of context or specific language. Because Davinci is the most complex
model, its latency is higher than that of other models. Curie is faster than Davinci and is
capable of summarizing conversations.

These tests suggest that you can generate better summaries when you provide more
instruction to the model via few-shot or fine-tuning. Fine-tuned models are better at
conforming to the structure and context learned from a training dataset. This capability
is especially useful when summaries are domain specific (for example, generating
summaries from a doctor's notes or online-prescription customer support). If you use
fine-tuning, you have more control over the types of summaries that you see.

For the sake of easy comparison, here's a summary of the results that are presented
earlier:

Ground truth

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Davinci zero-shot result

The customer asked how long they could access games after they leave the Xbox game
pass catalog. The agent told them that they would need to purchase a digital copy of
the game from the Xbox app for windows or the Microsoft store in order to continue
playing it. The agent also reminded the customer that they would be notified prior to a
game leaving the Xbox game pass catalog. As a member the customer could purchase
any game in the catalog for up to 20 off or the best available discounted price.

Curie zero-shot result

The customer asks the agent about the Xbox game pass. the agent tells the customer
that once a game leaves the Xbox game pass catalog the customer will need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game. The agent
also reminds the customer that Xbox will notify members prior to a game leaving the
Xbox game pass catalog.

Davinci few-shot result


customer wanted to know how long they could access games after they leave game
pass. Agent informs that once a game leaves the Xbox game pass catalog the customer
would need to purchase a digital copy or obtain another form of entitlement to
continue playing the game.

Curie few-shot result

customer has a question about the game pass. customer is good. agent needs help with
the Xbox game pass. customer asks how long they can access the games after they
leave the game pass catalog. Agent informs that once a game leaves the Xbox game
pass catalog the customer will need to purchase a digital copy from the Xbox app for
windows or the Microsoft store play from a disc or obtain another form of entitlement
to continue playing the game. customer is happy to hear this and thanks agent.

Curie fine-tuning result

customer wants to know how long they can access the games after they leave game
pass. agent explains that once a game leaves the Xbox game pass catalog they'll need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game.

Evaluating summarization
There are multiple techniques for evaluating the performance of summarization models.

Here are a few:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation). This technique includes


measures for automatically determining the quality of a summary by comparing it to
ideal summaries created by humans. The measures count the number of overlapping
units, like n-gram, word sequences, and word pairs, between the computer-generated
summary that's being evaluated and the ideal summaries.

Here's an example:

Python

reference_summary = "The cat ison porch by the tree"


generated_summary = "The cat is by the tree on the porch"
rouge = Rouge()
rouge.get_scores(generated_summary, reference_summary)
[{'rouge-1': {'r':1.0, 'p': 1.0, 'f': 0.999999995},
'rouge-2': {'r': 0.5714285714285714, 'p': 0.5, 'f': 0.5333333283555556},
'rouge-1': {'r': 0.75, 'p': 0.75, 'f': 0.749999995}}]
BertScore. This technique computes similarity scores by aligning generated and
reference summaries on a token level. Token alignments are computed greedily to
maximize the cosine similarity between contextualized token embeddings from BERT.

Here's an example:

Python

import torchmetrics
from torchmetrics.text.bert import BERTScore
preds = "You should have ice cream in the summer"
target = "Ice creams are great when the weather is hot"
bertscore = BERTScore()
score = bertscore(preds, target)
print(score)

Similarity matrix. A similarity matrix is a representation of the similarities between


different entities in summarization evaluation. You can use it to compare different
summaries of the same text and measure their similarity. It's represented by a two-
dimensional grid, where each cell contains a measure of the similarity between two
summaries. You can measure the similarity by using a variety of methods, like cosine
similarity, Jaccard similarity, and edit distance. You then use the matrix to compare the
summaries and determine which one is the most accurate representation of the original
text.

Here's a sample command that generates the similarity matrix of a BERTScore


comparison of two similar sentences:

Python

bert-score-show --lang en -r "The cat is on the porch by the tree"


-c "The cat is by the tree on the porch"
-f out.png

The first sentence, "The cat is on the porch by the tree," is referred to as the candidate.
The second sentence is referred as the reference. The command uses BERTScore to
compare the sentences and generate a matrix.

This following matrix displays the output that's generated by the preceding command:

For more information, see SummEval: Reevaluating Summarization Evaluation . For a


PyPI toolkit for summarization, see summ-eval 0.892 .

Responsible use
GPT can produce excellent results, but you need to check the output for social, ethical,
and legal biases and harmful results. When you fine-tune models, you need to remove
any data points that might be harmful for the model to learn. You can use red teaming
to identify any harmful outputs from the model. You can implement this process
manually and support it by using semi-automated methods. You can generate test cases
by using language models and then use a classifier to detect harmful behavior in the
test cases. Finally, you should perform a manual check of generated summaries to
ensure that they're ready to be used.

For more information, see Red Teaming Language Models with Language Models .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Meghna Jani | Data & Applied Scientist II

Other contributor:

Mick Alberts | Technical Writer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
More information about Azure OpenAI
ROUGE reference article
Training module: Introduction to Azure OpenAI Service
Learning path: Develop AI solutions with Azure OpenAI

Related resources
Query-based document summarization
Choose a Microsoft cognitive services technology
Natural language processing technology
Machine learning operations (MLOps)
v2
Article • 08/31/2023

This article describes three Azure architectures for machine learning operations. They all
have end-to-end continuous integration (CI), continuous delivery (CD), and retraining
pipelines. The architectures are for these AI applications:

Classical machine learning


Computer vision (CV)
Natural language processing (NLP)

The architectures are the product of the MLOps v2 project. They incorporate the best
practices that the solution architects discovered in the process of creating multiple
machine learning solutions. The result is deployable, repeatable, and maintainable
patterns as described here.

All of the architectures use the Azure Machine Learning service.

For an implementation with sample deployment templates for MLOps v2, see Azure
MLOps (v2) solution accelerator on GitHub.

Potential use cases


Classical machine learning: Time-Series forecasting, regression, and classification
on tabular structured data are the most common use cases in this category.
Examples are:
Binary and multi-label classification
Linear, polynomial, ridge, lasso, quantile, and Bayesian regression
ARIMA, autoregressive (AR), SARIMA, VAR, SES, LSTM
CV: The MLOps framework presented here focuses mostly on the CV use cases of
segmentation and image classification.
NLP: This MLOps framework can implement any of those use cases, and others not
listed:
Named entity recognition
Text classification
Text generation
Sentiment analysis
Translation
Question answering
Summarization
Sentence detection
Language detection
Part-of-speech tagging

Simulations, deep reinforcement learning, and other forms of AI aren't covered by this
article.

Architecture
The MLOps v2 architectural pattern is made up of four main modular elements that
represent these phases of the MLOps lifecycle:

Data estate
Administration and setup
Model development (inner loop)
Model deployment (outer loop)

These elements, the relationships between them, and the personas typically associated
with them are common for all MLOps v2 scenario architectures. There can be variations
in the details of each, depending on the scenario.

The base architecture for MLOps v2 for Machine Learning is the classical machine
learning scenario on tabular data. The CV and NLP architectures build on and modify
this base architecture.

Current architectures
The architectures currently covered by MLOps v2 and discussed in this article are:

Classical machine learning architecture


Machine Learning CV architecture
Machine Learning NLP architecture

Classical machine learning architecture


Download a Visio file of this architecture.

Workflow for the classical machine learning architecture

1. Data estate

This element illustrates the data estate of the organization, and potential data
sources and targets for a data science project. Data engineers are the primary
owners of this element of the MLOps v2 lifecycle. The Azure data platforms in this
diagram are neither exhaustive nor prescriptive. The data sources and targets that
represent recommended best practices based on the customer use case are
indicated by a green check mark.

2. Administration and setup

This element is the first step in the MLOps v2 accelerator deployment. It consists of
all tasks related to creation and management of resources and roles associated
with the project. These can include the following tasks, and perhaps others:
a. Creation of project source code repositories
b. Creation of Machine Learning workspaces by using Bicep, ARM, or Terraform
c. Creation or modification of datasets and compute resources that are used for
model development and deployment
d. Definition of project team users, their roles, and access controls to other
resources
e. Creation of CI/CD pipelines
f. Creation of monitors for collection and notification of model and infrastructure
metrics

The primary persona associated with this phase is the infrastructure team, but
there can also be data engineers, machine learning engineers, and data scientists.
3. Model development (inner loop)

The inner loop element consists of your iterative data science workflow that acts
within a dedicated, secure Machine Learning workspace. A typical workflow is
illustrated in the diagram. It proceeds from data ingestion, exploratory data
analysis, experimentation, model development and evaluation, to registration of a
candidate model for production. This modular element as implemented in the
MLOps v2 accelerator is agnostic and adaptable to the process your data science
team uses to develop models.

Personas associated with this phase include data scientists and machine learning
engineers.

4. Machine Learning registries

After the data science team develops a model that's a candidate for deploying to
production, the model can be registered in the Machine Learning workspace
registry. CI pipelines that are triggered, either automatically by model registration
or by gated human-in-the-loop approval, promote the model and any other model
dependencies to the model deployment phase.

Personas associated with this stage are typically machine learning engineers.

5. Model deployment (outer loop)

The model deployment or outer loop phase consists of pre-production staging


and testing, production deployment, and monitoring of model, data, and
infrastructure. CD pipelines manage the promotion of the model and related assets
through production, monitoring, and potential retraining, as criteria that are
appropriate to your organization and use case are satisfied.

Personas associated with this phase are primarily machine learning engineers.

6. Staging and test

The staging and test phase can vary with customer practices but typically includes
operations such as retraining and testing of the model candidate on production
data, test deployments for endpoint performance, data quality checks, unit testing,
and responsible AI checks for model and data bias. This phase takes place in one
or more dedicated, secure Machine Learning workspaces.

7. Production deployment

After a model passes the staging and test phase, it can be promoted to production
by using a human-in-the-loop gated approval. Model deployment options include
a managed batch endpoint for batch scenarios or, for online, near-real-time
scenarios, either a managed online endpoint or Kubernetes deployment by using
Azure Arc. Production typically takes place in one or more dedicated, secure
Machine Learning workspaces.

8. Monitoring

Monitoring in staging, test, and production makes it possible for you to collect
metrics for, and act on, changes in performance of the model, data, and
infrastructure. Model and data monitoring can include checking for model and
data drift, model performance on new data, and responsible AI issues.
Infrastructure monitoring can watch for slow endpoint response, inadequate
compute capacity, or network problems.

9. Data and model monitoring: events and actions

Based on criteria for model and data matters of concern such as metric thresholds
or schedules, automated triggers and notifications can implement appropriate
actions to take. This can be regularly scheduled automated retraining of the model
on newer production data and a loopback to staging and test for pre-production
evaluation. Or, it can be due to triggers on model or data issues that require a
loopback to the model development phase where data scientists can investigate
and potentially develop a new model.

10. Infrastructure monitoring: events and actions

Based on criteria for infrastructure matters of concern such as endpoint response


lag or insufficient compute for the deployment, automated triggers and
notifications can implement appropriate actions to take. They trigger a loopback to
the setup and administration phase where the infrastructure team can investigate
and potentially reconfigure the compute and network resources.

Machine Learning CV architecture


Download a Visio file of this architecture.

Workflow for the CV architecture


The Machine Learning CV architecture is based on the classical machine learning
architecture, but it has modifications that are particular to supervised CV scenarios.

1. Data estate

This element illustrates the data estate of the organization and potential data
sources and targets for a data science project. Data engineers are the primary
owners of this element of the MLOps v2 lifecycle. The Azure data platforms in this
diagram are neither exhaustive nor prescriptive. Images for CV scenarios can come
from many different data sources. For efficiency when developing and deploying
CV models with Machine Learning, recommended Azure data sources for images
are Azure Blob Storage and Azure Data Lake Storage.

2. Administration and setup

This element is the first step in the MLOps v2 accelerator deployment. It consists of
all tasks related to creation and management of resources and roles associated
with the project. For CV scenarios, administration and setup of the MLOps v2
environment is largely the same as for classical machine learning, but with an
additional step: create image labeling and annotation projects by using the
labeling feature of Machine Learning or another tool.

3. Model development (inner loop)

The inner loop element consists of your iterative data science workflow performed
within a dedicated, secure Machine Learning workspace. The primary difference
between this workflow and the classical machine learning scenario is that image
labeling and annotation is a key element of this development loop.

4. Machine Learning registries

After the data science team develops a model that's a candidate for deploying to
production, the model can be registered in the Machine Learning workspace
registry. CI pipelines that are triggered either automatically by model registration
or by gated human-in-the-loop approval promote the model and any other model
dependencies to the model deployment phase.

5. Model deployment (outer loop)

The model deployment or outer loop phase consists of pre-production staging


and testing, production deployment, and monitoring of model, data, and
infrastructure. CD pipelines manage the promotion of the model and related assets
through production, monitoring, and potential retraining as criteria appropriate to
your organization and use case are satisfied.

6. Staging and test

The staging and test phase can vary with customer practices but typically includes
operations such as test deployments for endpoint performance, data quality
checks, unit testing, and responsible AI checks for model and data bias. For CV
scenarios, retraining of the model candidate on production data can be omitted
due to resource and time constraints. Instead, the data science team can use
production data for model development, and the candidate model that's
registered from the development loop is the model that's evaluated for
production. This phase takes place in one or more dedicated, secure Machine
Learning workspaces.

7. Production deployment

After a model passes the staging and test phase, it can be promoted to production
via human-in-the-loop gated approvals. Model deployment options include a
managed batch endpoint for batch scenarios or, for online, near-real-time
scenarios, either a managed online endpoint or Kubernetes deployment by using
Azure Arc. Production typically takes place in one or more dedicated, secure
Machine Learning workspaces.

8. Monitoring

Monitoring in staging, test, and production makes it possible for you to collect
metrics for, and act on, changes in the performance of the model, data, and
infrastructure. Model and data monitoring can include checking for model
performance on new images. Infrastructure monitoring can watch for slow
endpoint response, inadequate compute capacity, or network problems.

9. Data and model monitoring: events and actions

The data and model monitoring and event and action phases of MLOps for NLP
are the key differences from classical machine learning. Automated retraining is
typically not done in CV scenarios when model performance degradation on new
images is detected. In this case, new images for which the model performs poorly
must be reviewed and annotated by a human-in-the-loop process, and often the
next action goes back to the model development loop for updating the model
with the new images.

10. Infrastructure monitoring: events and actions

Based on criteria for infrastructure matters of concern such as endpoint response


lag or insufficient compute for the deployment, automated triggers and
notifications can implement appropriate actions to take. This triggers a loopback
to the setup and administration phase where the infrastructure team can
investigate and potentially reconfigure environment, compute, and network
resources.

Machine Learning NLP architecture

Download a Visio file of this architecture.

Workflow for the NLP architecture


The Machine Learning NLP architecture is based on the classical machine learning
architecture, but it has some modifications that are particular to NLP scenarios.

1. Data estate

This element illustrates the organization data estate and potential data sources
and targets for a data science project. Data engineers are the primary owners of
this element of the MLOps v2 lifecycle. The Azure data platforms in this diagram
are neither exhaustive nor prescriptive. Data sources and targets that represent
recommended best practices based on the customer use case are indicated by a
green check mark.

2. Administration and setup

This element is the first step in the MLOps v2 accelerator deployment. It consists of
all tasks related to creation and management of resources and roles associated
with the project. For NLP scenarios, administration and setup of the MLOps v2
environment is largely the same as for classical machine learning, but with an
additional step: create image labeling and annotation projects by using the
labeling feature of Machine Learning or another tool.

3. Model development (inner loop)

The inner loop element consists of your iterative data science workflow performed
within a dedicated, secure Machine Learning workspace. The typical NLP model
development loop can be significantly different from the classical machine learning
scenario in that annotators for sentences and tokenization, normalization, and
embeddings for text data are the typical development steps for this scenario.

4. Machine Learning registries

After the data science team develops a model that's a candidate for deploying to
production, the model can be registered in the Machine Learning workspace
registry. CI pipelines that are triggered either automatically by model registration
or by gated human-in-the-loop approval promote the model and any other model
dependencies to the model deployment phase.

5. Model deployment (outer loop)

The model deployment or outer loop phase consists of pre-production staging


and testing, production deployment, and monitoring of the model, data, and
infrastructure. CD pipelines manage the promotion of the model and related assets
through production, monitoring, and potential retraining, as criteria for your
organization and use case are satisfied.
6. Staging and test

The staging and test phase can vary with customer practices, but typically includes
operations such as retraining and testing of the model candidate on production
data, test deployments for endpoint performance, data quality checks, unit testing,
and responsible AI checks for model and data bias. This phase takes place in one
or more dedicated, secure Machine Learning workspaces.

7. Production deployment

After a model passes the staging and test phase, it can be promoted to production
by a human-in-the-loop gated approval. Model deployment options include a
managed batch endpoint for batch scenarios or, for online, near-real-time
scenarios, either a managed online endpoint or Kubernetes deployment by using
Azure Arc. Production typically takes place in one or more dedicated, secure
Machine Learning workspaces.

8. Monitoring

Monitoring in staging, test, and production makes it possible for you to collect and
act on changes in performance of the model, data, and infrastructure. Model and
data monitoring can include checking for model and data drift, model
performance on new text data, and responsible AI issues. Infrastructure monitoring
can watch for issues such as slow endpoint response, inadequate compute
capacity, and network problems.

9. Data and model monitoring: events and actions

As with the CV architecture, the data and model monitoring and event and action
phases of MLOps for NLP are the key differences from classical machine learning.
Automated retraining isn't typically done in NLP scenarios when model
performance degradation on new text is detected. In this case, new text data for
which the model performs poorly must be reviewed and annotated by a human-in-
the-loop process. Often the next action is to go back to the model development
loop to update the model with the new text data.

10. Infrastructure monitoring: events and actions

Based on criteria for infrastructure matters of concern such as endpoint response


lag or insufficient compute for the deployment, automated triggers and
notifications can implement appropriate actions to take. They trigger a loopback to
the setup and administration phase where the infrastructure team can investigate
and potentially reconfigure the compute and network resources.
Components
Machine Learning : A cloud service for training, scoring, deploying, and
managing machine learning models at scale.
Azure Pipelines : This build and test system is based on Azure DevOps and is
used for the build and release pipelines. Azure Pipelines splits these pipelines into
logical steps called tasks.
GitHub : A code hosting platform for version control, collaboration, and CI/CD
workflows.
Azure Arc : A platform for managing Azure and on-premises resources by using
Azure Resource Manager. The resources can include virtual machines, Kubernetes
clusters, and databases.
Kubernetes : An open-source system for automating deployment, scaling, and
management of containerized applications.
Azure Data Lake : A Hadoop-compatible file system. It has an integrated
hierarchical namespace and the massive scale and economy of Blob Storage.
Azure Synapse Analytics : A limitless analytics service that brings together data
integration, enterprise data warehousing, and big data analytics.
Azure Event Hubs . A service that ingests data streams generated by client
applications. It then ingests and stores streaming data, preserving the sequence of
events received. Consumers can connect to the hub endpoints to retrieve
messages for processing. Here we are taking advantage of the integration with
Data Lake Storage.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Scott Donohoo | Senior Cloud Solution Architect


Moritz Steller | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is Azure Pipelines?
Azure Arc overview
What is Azure Machine Learning?
Data in Azure Machine Learning
Azure MLOps (v2) solution accelerator
End-to-end machine learning operations (MLOps) with Azure Machine Learning
Introduction to Azure Data Lake Storage Gen2
Azure DevOps documentation
GitHub Docs
Azure Synapse Analytics documentation
Azure Event Hubs documentation

Related resources
Choose a Microsoft cognitive services technology
Natural language processing technology
Compare the machine learning products and technologies from Microsoft
How Azure Machine Learning works: resources and assets (v2)
What are Azure Machine Learning pipelines?
Machine learning operations (MLOps) framework to upscale machine learning
lifecycle with Azure Machine Learning
What is the Team Data Science Process?
MLOps for Python models using
Azure Machine Learning
Azure Blob Storage Azure Container Registry Azure DevOps Azure Machine Learning Azure Pipelines

This reference architecture shows how to implement continuous integration (CI),


continuous delivery (CD), and retraining pipeline for an AI application using Azure
DevOps and Azure Machine Learning. The solution is built on the scikit-learn diabetes
dataset but can be easily adapted for any AI scenario and other popular build systems
such as Jenkins or Travis.

A reference implementation for this architecture is available on GitHub .

Architecture

Download a Visio file of this architecture.


Workflow
This architecture consists of the following services:

Azure Pipelines. This build and test system is based on Azure DevOps and used for the
build and release pipelines. Azure Pipelines breaks these pipelines into logical steps
called tasks. For example, the Azure CLI task makes it easier to work with Azure
resources.

Azure Machine Learning is a cloud service for training, scoring, deploying, and
managing machine learning models at scale. This architecture uses the Azure Machine
Learning Python SDK to create a workspace, compute resources, the machine learning
pipeline, and the scoring image. An Azure Machine Learning workspace provides the
space in which to experiment, train, and deploy machine learning models.

Azure Machine Learning Compute is a cluster of virtual machines on-demand with


automatic scaling and GPU and CPU node options. The training job is executed on this
cluster.

Azure Machine Learning pipelines provide reusable machine learning workflows that
can be reused across scenarios. Training, model evaluation, model registration, and
image creation occur in distinct steps within these pipelines for this use case. The
pipeline is published or updated at the end of the build phase and gets triggered on
new data arrival.

Azure Blob Storage. Blob containers are used to store the logs from the scoring service.
In this case, both the input data and the model prediction are collected. After some
transformation, these logs can be used for model retraining.

Azure Container Registry. The scoring Python script is packaged as a Docker image and
versioned in the registry.

Azure Container Instances. As part of the release pipeline, the QA and staging
environment is mimicked by deploying the scoring webservice image to Container
Instances, which provides an easy, serverless way to run a container.

Azure Kubernetes Service. Once the scoring webservice image is thoroughly tested in
the QA environment, it is deployed to the production environment on a managed
Kubernetes cluster.

Azure Application Insights. This monitoring service is used to detect performance


anomalies.
MLOps Pipeline
This solution demonstrates end-to-end automation of various stages of an AI project
using tools that are already familiar to software engineers. The machine learning
problem is simple to keep the focus on the DevOps pipeline. The solution uses the
scikit-learn diabetes dataset and builds a ridge linear regression model to predict the
likelihood of diabetes.

This solution is based on the following three pipelines:

Build pipeline. Builds the code and runs a suite of tests.


Retraining pipeline. Retrains the model on a schedule or when new data becomes
available.
Release pipeline. Operationalizes the scoring image and promotes it safely across
different environments.

The next sections describe each of these pipelines.

Build pipeline
The CI pipeline gets triggered every time code is checked in. It publishes an updated
Azure Machine Learning pipeline after building the code and running a suite of tests.
The build pipeline consists of the following tasks:

Code quality. These tests ensure that the code conforms to the standards of the
team.

Unit test. These tests make sure the code works, has adequate code coverage, and
is stable.

Data test. These tests verify that the data samples conform to the expected
schema and distribution. Customize this test for other use cases and run it as a
separate data sanity pipeline that gets triggered as new data arrives. For example,
move the data test task to a data ingestion pipeline so you can test it earlier.

7 Note

You should consider enabling DevOps practices for the data used to train the
machine learning models, but this is not covered in this article. For more
information about the architecture and best practices for CI/CD of a data ingestion
pipeline, see DevOps for a data ingestion pipeline.
The following one-time tasks occur when setting up the infrastructure for Azure
Machine Learning and the Python SDK:

Create the workspace that hosts all Azure Machine Learning-related resources.
Create the compute resources that run the training job.
Create the machine learning pipeline with the updated training script.
Publish the machine learning pipeline as a REST endpoint to orchestrate the
training workflow. The next section describes this step.

Retraining pipeline
The machine learning pipeline orchestrates the process of retraining the model in an
asynchronous manner. Retraining can be triggered on a schedule or when new data
becomes available by calling the published pipeline REST endpoint from the previous
step.

This pipeline covers the following steps:

Train model. The training Python script is executed on the Azure Machine Learning
Compute resource to get a new model file which is stored in the run history. Since
training is the most compute-intensive task in an AI project, the solution uses
Azure Machine Learning Compute.

Evaluate model. A simple evaluation test compares the new model with the
existing model. Only when the new model is better does it get promoted.
Otherwise, the model is not registered and the pipeline is canceled.

Register model. The retrained model is registered with the Azure ML Model
registry. This service provides version control for the models along with metadata
tags so they can be easily reproduced.

Release pipeline
This pipeline shows how to operationalize the scoring image and promote it safely
across different environments. This pipeline is subdivided into two environments, QA
and production:

QA environment
Model Artifact trigger. Release pipelines get triggered every time a new artifact is
available. A new model registered to Azure Machine Learning Model Management
is treated as a release artifact. In this case, a pipeline is triggered for each new
model is registered.

Create a scoring image. The registered model is packaged together with a scoring
script and Python dependencies (Conda YAML file ) into an operationalization
Docker image. The image automatically gets versioned through Azure Container
Registry.

Deploy on Container Instances. This service is used to create a non-production


environment. The scoring image is also deployed here, and this is mostly used for
testing. Container Instances provides an easy and quick way to test the Docker
image.

Test web service. A simple API test makes sure the image is successfully deployed.

Production environment
Deploy on Azure Kubernetes Service. This service is used for deploying a scoring
image as a web service at scale in a production environment.

Test web service. A simple API test makes sure the image is successfully deployed.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Scalability
A build pipeline on Azure DevOps can be scaled for applications of any size. Build
pipelines have a maximum timeout that varies depending on the agent they are run on.
Builds can run forever on self-hosted agents (private agents). For Microsoft-hosted
agents for a public project, builds can run for six hours. For private projects, the limit is
30 minutes.

To use the maximum timeout, set the following property in your Azure Pipelines YAML
file:

YAML

jobs:
- job: <job_name>
timeoutInMinutes: 0

Ideally, have your build pipeline finish quickly and execute only unit tests and a subset
of other tests. This allows you to validate the changes quickly and fix them if issues arise.
Run long-running tests during off-hours.

The release pipeline publishes a real-time scoring web service. A release to the QA
environment is done using Container Instances for convenience, but you can use
another Kubernetes cluster running in the QA/staging environment.

Scale the production environment according to the size of your Azure Kubernetes
Service cluster. The size of the cluster depends on the load you expect for the deployed
scoring web service. For real-time scoring architectures, throughput is a key
optimization metric. For non-deep learning scenarios, the CPU should be sufficient to
handle the load; however, for deep learning workloads, when speed is a bottleneck,
GPUs generally provide better performance compared to CPUs. Azure Kubernetes
Service supports both CPU and GPU node types, which is the reason this solution uses it
for image deployment. For more information, see GPUs vs CPUs for deployment of deep
learning models.

Scale the retraining pipeline up and down depending on the number of nodes in your
Azure Machine Learning Compute resource, and use the autoscaling option to manage
the cluster. This architecture uses CPUs. For deep learning workloads, GPUs are a better
choice and are supported by Azure Machine Learning Compute.

Management
Monitor retraining job. Machine learning pipelines orchestrate retraining across a
cluster of machines and provide an easy way to monitor them. Use the Azure
Machine Learning UI and look under the pipelines section for the logs.
Alternatively, these logs are also written to blob and can be read from there as well
using tools such as Azure Storage Explorer .

Logging. Azure Machine Learning provides an easy way to log at each step of the
machine learning life cycle. The logs are stored in a blob container. For more
information, see Enable logging in Azure Machine Learning. For richer monitoring,
configure Application Insights to use the logs.

Security. All secrets and credentials are stored in Azure Key Vault and accessed in
Azure Pipelines using variable groups.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

Azure DevOps is free for open-source projects and small projects with up to five
users. For larger teams, purchase a plan based on the number of users.

Compute is the biggest cost driver in this architecture and its cost varies depending on
the use case. This architecture uses Azure Machine Learning Compute, but other options
are available. Azure Machine Learning does not add any surcharge on top of the cost of
the virtual machines backing your compute cluster. Configure your compute cluster to
have a minimum of 0 nodes, so that when not in use, it can scale down to 0 nodes and
not incur any costs. The compute cost depends on the node type, a number of nodes,
and provisioning mode (low-priority or dedicated). You can estimate the cost for
Machine Learning and other services using the Azure pricing calculator .

Deploy this scenario


To deploy this reference architecture, follow the steps described in the Getting Started
guide in the GitHub repo .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Praneet Singh Solanki | Senior Software Engineer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Want to learn more? Check out the related learning path, Start the machine
learning lifecycle with MLOps.
Secure MLOps solutions with
Azure network security
Azure DevOps Azure DNS Azure Machine Learning Azure Private Link Azure Virtual Network

Machine Learning DevOps (MLOps), first highlighted in Hidden Technical Debt in


Machine Learning Systems in 2015, is growing fast. The market for MLOps is expected
to reach $4 billion by 2025. In the meantime, working to secure MLOps solutions is
becoming more important.

This article describes how to help protect MLOps solutions by using Azure network
security capabilities such as Azure Virtual Network, network peering, Azure Private Link,
and Azure DNS. It also introduces how to use:

Azure Pipelines to access resources in the virtual network


The required configurations of Azure Container Registry and Azure Machine
Learning compute instances and clusters in a virtual network.

Finally, this article describes the costs of using the network security services.

Architecture
Azure Blob Azure Machine Azure Container Azure Key
Storage Learning workspace Registry Vault User
Azure VPN
Gateway

Private endpoints Azure Bastion


Jump host

Self-hosted
Compute Compute Azure Pipelines
agent
instance cluster
Virtual network
peering

BASTION VNET
10.2.0.0/16
Azure Kubernetes cluster

AML VNET
10.1.0.0/16

Private DNS zones Azure Monitor

Resource group

Microsoft
Azure

Download a Visio file of this architecture.

Dataflow
The architecture diagram shows a sample MLOps solution.

The virtual network named AML VNET helps protect the Azure Machine Learning
workspace and its associated resources.

The jump host, Azure Bastion, and self-hosted agents belong to another virtual
network named BASTION VNET. This arrangement simulates having another
solution that requires access to the resources within AML VNET.

With the support of virtual network peering and private DNS zones, Azure
Pipelines can execute on self-host agents and trigger the Azure Machine Learning
pipelines that are published in the Azure Machine Learning workspace to train,
evaluate, and register the machine learning models.

Finally, the model is deployed to online endpoints or batch endpoints that are
supported by Azure Machine Learning compute or Azure Kubernetes Service
clusters.
Components
The sample MLOps solution consists of these components:

Data storage: Azure Blob Storage for data storage.


Model training, validation, and registration: Azure Machine Learning workspace
Model deployment: Azure Machine Learning endpoints and Azure Kubernetes
Service
Model monitor: Azure Monitor for Application Insights
MLOps pipelines: Azure DevOps and Azure Pipelines

This example scenario also uses the following services to help protect the MLOps
solution:

Azure Key Vault


Azure Policy
Virtual Network

Scenario details
MLOps is a set of practices at the intersection of Machine Learning, DevOps, and data
engineering that aims to deploy and maintain machine learning models in production
reliably and efficiently.

The following diagram shows a simplified MLOps process model. This model offers a
solution that automates data preparation, model training, model evaluation, model
registration, model deployment, and monitoring.

When you implement an MLOps solution, you might want to help secure these
resources:

DevOps pipelines
Machine learning training data
Machine learning pipelines
Machine learning models

To help secure resources, consider these methods:


Authentication and authorization
Use service principals or managed identities instead of interactive
authentication.
Use role-based access control to define the scope of a user's access to
resources.

Network security
Use Virtual Network to partially or fully isolate the environment from the public
internet to reduce the attack surface and the potential for data exfiltration.
In the Azure Machine Learning workspace, if you're still using Azure Machine
Learning CLI v1 and Azure Machine Learning Python SDK v1 (such as v1 API),
add a private endpoint to the workspace to provide network isolation for
everything except CRUD operations on the workspace or compute resources.
To take advantage of the new features of an Azure Machine Learning
workspace, use Azure Machine Learning CLI v2 and Azure Machine Learning
Python SDK v2 (such as v2 API), in which enabling a private endpoint on your
workspace doesn't provide the same level of network isolation. However, the
virtual network will still help protect the training data and machine learning
models. We recommend you evaluate v2 API before adopting it in your
enterprise solutions. See What is the new API platform on Azure Resource
Manager for more information.

Data encryption
Encrypt training data in transit and at rest by using platform-managed or
customer-managed access keys.

Policy and monitoring


Use Azure Policy and Microsoft Defender for Cloud to enforce policies.
Use Azure Monitor to collect and aggregate data (such as metrics and logs)
from various sources into a common data platform for analysis, visualization,
and alerting.

The Azure Machine Learning workspace is the top-level resource for Azure Machine
Learning and the core component of an MLOps solution. The workspace provides a
centralized place to work with all the artifacts that you create when you use Azure
Machine Learning.

When you create a new workspace, it automatically creates the following Azure
resources that are used by the workspace:

Azure Application Insights


Azure Container Registry
Azure Key Vault
Azure Storage Account

Potential use cases


This solution fits scenarios in which a customer uses an MLOps solution to deploy and
maintain machine learning models in a more secure environment. Customers can come
from various industries, such as manufacturing, telecommunications, retail, healthcare,
and so on. For example:

A telecommunications carrier helps protect a customer's pictures, data, and


machine learning models in its video monitoring system for retail stores.

An engine manufacturer needs a more secure solution to help protect the data and
machine learning models of its factories and products for its system that uses
computer vision to detect defects in parts.

The MLOps solutions for these scenarios and others might use Azure Machine Learning
workspaces, Azure Blob Storage, Azure Kubernetes Service, Container Registry, and
other Azure services.

You can use all or part of this example for any similar scenario that has an MLOps
environment that's deployed on Azure and uses Azure security capabilities to help
protect the relevant resources. The original customer for this solution is in the
telecommunications industry.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that improve the quality of a workload when applied. For
more information, see Microsoft Azure Well-Architected Framework.

Security
Security provides more assurances against deliberate attacks and the abuse of your
valuable data and systems. For more information, see Overview of the security pillar.

Consider how to help secure your MLOps solution beginning with the architecture
design. Development environments might not need significant security, but it's
important in the staging and production environments.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

Configuring Virtual Network is free of charge, but there are charges for the other
services that your scenario might require, such as private links, DNS zones, and virtual
network peering. The following table describes the charges for those services and others
that might be required.

Azure Service Pricing

Virtual Network Free of charge.

Private Link Pay only for private endpoint resource hours and the data that is
processed through your private endpoint.

Azure DNS, private Billing is based on the number of DNS zones that are hosted in
zone Azure and the number of DNS queries that are received.

Virtual Network Inbound and outbound traffic is charged at both ends of the peered
peering networks.

VPN gateway Charges are based on the amount of time that the gateway is
provisioned and available.

ExpressRoute Charges are for ExpressRoute and ExpressRoute Gateways.

Azure Bastion Billing involves a combination of hourly pricing that is based on SKU,
scale units, and data transfer rates.

Operational excellence
Operational excellence covers the operations processes that deploy an application and
keep it running in production. For more information, see Overview of the operational
excellence pillar.

To streamline continuous integration and continuous delivery (CI/CD), the best practice
is to use tools and services for infrastructure as code (IaC), such as Terraform or Azure
Resource Manager templates, Azure DevOps, and Azure Pipelines.
Deploy this scenario
The following sections describe how to deploy, access, and help secure resources in this
example scenario.

Virtual Network
The first step in helping to secure the MLOps environment is to help protect the Azure
Machine Learning workspace and its associated resources. An effective method of
protection is to use Virtual Network. Virtual Network is the fundamental building block
for your private network in Azure. Virtual Network lets many types of Azure resources
more securely communicate with each other, the internet, and on-premises networks.

Putting the Azure Machine Learning workspace and its associated resources into a
virtual network helps ensure that components can communicate with each other
without exposing them to the public internet. Doing so reduces their attack surface and
helps to prevent data exfiltration.

The following Terraform snippet shows how to create a compute cluster for Azure
Machine Learning, attach it to a workspace, and put it into a subnet of a virtual network.

Terraform

resource "azurerm_machine_learning_compute_cluster" "compute_cluster" {


name = "my_compute_cluster"
location = "eastasia"
vm_priority = "LowPriority"
vm_size = "Standard_NC6s_v3"
machine_learning_workspace_id =
azurerm_machine_learning_workspace.my_workspace.id
subnet_resource_id = azurerm_subnet.compute_subnet.id
ssh_public_access_enabled = false
scale_settings {
min_node_count = 0
max_node_count = 3
scale_down_nodes_after_idle_duration = "PT30S"
}
identity {
type = "SystemAssigned"
}
}

Private Link and Azure Private Endpoint


Private Link enables access over a private endpoint in your virtual network to Azure
platform as a service (PaaS) options, such as an Azure Machine Learning workspace and
Azure Storage, and to Azure-hosted customer-owned and partner-owned services. A
private endpoint is a network interface that connects only to specific resources, thereby
helping to protect against data exfiltration.

In this example scenario, there are four private endpoints that are tied to Azure PaaS
options and are managed by a subnet in AML VNET, as shown in the architecture
diagram. Therefore, these services are only accessible to the resources within the same
virtual network, AML VNET. Those services are:

Azure Machine Learning workspace


Azure Blob Storage
Azure Container Registry
Azure Key Vault

The following Terraform snippet shows how to use a private endpoint to link to an Azure
Machine Learning workspace, which is more protected by the virtual network as a result.
The snippet also shows use of a private DNS zone, which is described in Private DNS
zones.

Terraform

resource "azurerm_machine_learning_workspace" "aml_ws" {


name = "my_aml_workspace"
friendly_name = "my_aml_workspace"
location = "eastasia"
resource_group_name = "my_resource_group"
application_insights_id = azurerm_application_insights.my_ai.id
key_vault_id = azurerm_key_vault.my_kv.id
storage_account_id = azurerm_storage_account.my_sa.id
container_registry_id = azurerm_container_registry.my_acr_aml.id

identity {
type = "SystemAssigned"
}
}

# Configure private DNS zones

resource "azurerm_private_dns_zone" "ws_zone_api" {


name = "privatelink.api.azureml.ms"
resource_group_name = var.RESOURCE_GROUP
}

resource "azurerm_private_dns_zone" "ws_zone_notebooks" {


name = "privatelink.notebooks.azure.net"
resource_group_name = var.RESOURCE_GROUP
}
# Link DNS zones to the virtual network

resource "azurerm_private_dns_zone_virtual_network_link" "ws_zone_api_link"


{
name = "ws_zone_link_api"
resource_group_name = "my_resource_group"
private_dns_zone_name = azurerm_private_dns_zone.ws_zone_api.name
virtual_network_id = azurerm_virtual_network.aml_vnet.id
}

resource "azurerm_private_dns_zone_virtual_network_link"
"ws_zone_notebooks_link" {
name = "ws_zone_link_notebooks"
resource_group_name = "my_resource_group"
private_dns_zone_name = azurerm_private_dns_zone.ws_zone_notebooks.name
virtual_network_id = azurerm_virtual_network.aml_vnet.id
}

# Configure private endpoints

resource "azurerm_private_endpoint" "ws_pe" {


name = "my_aml_ws_pe"
location = "eastasia"
resource_group_name = "my_resource_group"
subnet_id = azurerm_subnet.my_subnet.id

private_service_connection {
name = "my_aml_ws_psc"
private_connection_resource_id =
azurerm_machine_learning_workspace.aml_ws.id
subresource_names = ["amlworkspace"]
is_manual_connection = false
}

private_dns_zone_group {
name = "private-dns-zone-group-ws"
private_dns_zone_ids = [azurerm_private_dns_zone.ws_zone_api.id,
azurerm_private_dns_zone.ws_zone_notebooks.id]
}

# Add the private link after configuring the workspace


depends_on = [azurerm_machine_learning_compute_instance.compute_instance,
azurerm_machine_learning_compute_cluster.compute_cluster]
}

The preceding code for azurerm_machine_learning_workspace will use v2 API platform by


default. If you still want to use the v1 API or have a company policy that prohibits
sending communication over public networks, you can enable the v1_legacy_mode
parameter, as shown in the following code snippet. When enabled, this parameter
disables the v2 API for your workspace.
Terraform

resource "azurerm_machine_learning_workspace" "aml_ws" {


...
public_network_access_enabled = false
v1_legacy_mode_enabled = true
}

Private DNS zones


Azure DNS provides a reliable, more secure DNS service to manage and resolve domain
names in a virtual network without the need to add a custom DNS solution. By using
private DNS zones, you can use custom domain names rather than the names provided
by Azure. DNS resolution against a private DNS zone works only from virtual networks
that are linked to it.

This sample solution uses private endpoints for the Azure Machine Learning workspace
and for its associated resources such as Azure Storage, Azure Key Vault, or Container
Registry. Therefore, you must configure your DNS settings to resolve the IP addresses of
the private endpoints from the fully qualified domain name (FQDN) of the connection
string.

You can link a private DNS zone to a virtual network to resolve specific domains.

The Terraform snippet in Private Link and Azure Private Endpoint creates two private
DNS zones by using the zone names that are recommended in Azure services DNS zone
configuration:

privatelink.api.azureml.ms
privatelink.notebooks.azure.net

Virtual Network peering


Virtual network peering enables the access of the jump-host virtual machine (VM) or
self-hosted agent VMs in BASTION VNET to the resources in AML VNET. For connectivity
purposes, the two virtual networks work as one. The traffic between VMs and Azure
Machine Learning resources in peered virtual networks uses the Azure backbone
infrastructure. Traffic between the virtual networks is routed through Azure's private
network.

The following Terraform snippet sets up virtual network peering between AML VNET and
BASTION VNET.
Terraform

# Virtual network peering for AML VNET and BASTION VNET


resource "azurerm_virtual_network_peering" "vp_amlvnet_basvnet" {
name = "vp_amlvnet_basvnet"
resource_group_name = "my_resource_group"
virtual_network_name = azurerm_virtual_network.amlvnet.name
remote_virtual_network_id = azurerm_virtual_network.basvnet.id
allow_virtual_network_access = true
allow_forwarded_traffic = true
}

resource "azurerm_virtual_network_peering" "vp_basvnet_amlvnet" {


name = "vp_basvnet_amlvnet"
resource_group_name = "my_resource_group"
virtual_network_name = azurerm_virtual_network.basvnet.name
remote_virtual_network_id = azurerm_virtual_network.amlvnet.id
allow_virtual_network_access = true
allow_forwarded_traffic = true
}

Access the resources in the virtual network


To access the Azure Machine Learning workspace in a virtual network, like AML VNET in
this scenario, use one of the following methods:

Azure VPN gateway


Azure ExpressRoute
Azure Bastion and the jump host VM

For more information, see Connect to the workspace.

Run Azure Pipelines that access the resources in the


virtual network
Azure Pipelines automatically builds and tests code projects to make them available to
others. Azure Pipelines combines CI/CD to test and build your code and ship it to any
target.

Azure-hosted agents vs. self-hosted agents


The MLOps solution in this example scenario consists of two pipelines, which can trigger
Azure Machine Learning pipelines and access associated resources. Since the Azure
Machine Learning workspace and its associated resource are in a virtual network, this
scenario must provide a way for an Azure Pipelines agent to access them. An agent is
computing infrastructure with installed agent software that runs jobs of the Azure
Pipelines one at a time. There are multiple ways to implement access:

Use self-hosted agents in the same virtual network or the peering virtual network,
as shown in the architecture diagram.

Use Azure-hosted agents and add their IP address ranges to an allowlist in the
firewall settings of the targeted Azure services.

Use Azure-hosted agents (as VPN clients) and VPN Gateway.

Each of these choices has pros and cons. The following table compares Azure-hosted
agents with self-hosted agents.

Azure-hosted Agent Self-hosted Agent

Cost Start free for one parallel Start free for one parallel job with
job with 1,800 minutes per unlimited minutes per month and a
month and a charge for charge for each extra self-hosted CI/CD
each Azure-hosted CI/CD parallel job with unlimited minutes.
parallel job. This option offers less-expensive
parallel jobs.

Maintenance Taken care of for you by Maintained by you with more control
Microsoft. over installing the software you like.

Build Time More time consuming Saves time because it keeps all your
because it completely files and caches.
refreshes every time you
start a build, and you
always build from scratch.

7 Note

For current pricing, see Pricing for Azure DevOps .

Based on the comparisons in the table and the considerations of security and
complexity, this example scenario uses a self-hosted agent for Azure Pipelines to trigger
Azure Machine Learning pipelines in the virtual network.

To configure a self-hosted agent, you have the following options:

Install the agent on Azure Virtual Machines.


Install the agents on an Azure Virtual Machine Scale Set, which can be auto-scaled
to meet demand.

Install the agent on a Docker container. This option isn't feasible, because this
scenario might require running the Docker container within the agent for machine
learning model training.

The following sample code provisions two self-hosted agents by creating Azure VMs
and extensions:

Terraform

resource "azurerm_linux_virtual_machine" "agent" {


...
}

resource "azurerm_virtual_machine_extension" "update-vm" {


count = 2
name = "update-vm${format("%02d", count.index)}"
publisher = "Microsoft.Azure.Extensions"
type = "CustomScript"
type_handler_version = "2.1"
virtual_machine_id = element(azurerm_linux_virtual_machine.agent.*.id,
count.index)

settings = <<SETTINGS
{
"script":
"${base64encode(templatefile("../scripts/terraform/agent_init.sh", {
AGENT_USERNAME = "${var.AGENT_USERNAME}",
ADO_PAT = "${var.ADO_PAT}",
ADO_ORG_SERVICE_URL = "${var.ADO_ORG_SERVICE_URL}",
AGENT_POOL = "${var.AGENT_POOL}"
}))}"
}
SETTINGS
}

As shown in the preceding code block, the Terraform script calls agent_init.sh, shown in
the following code block, to install agent software and required libraries on the agent
VM per the customer's requirements.

Bash

#!/bin/sh
# Install other required libraries
...

# Creates directory and downloads Azure DevOps agent installation files


sudo mkdir /myagent
cd /myagent
sudo wget https://vstsagentpackage.azureedge.net/agent/2.194.0/vsts-agent-
linux-x64-2.194.0.tar.gz
sudo tar zxvf ./vsts-agent-linux-x64-2.194.0.tar.gz
sudo chmod -R 777 /myagent

# Unattended installation
sudo runuser -l ${AGENT_USERNAME} -c '/myagent/config.sh --unattended --url
${ADO_ORG_SERVICE_URL} --auth pat --token ${ADO_PAT} --pool ${AGENT_POOL}'

cd /myagent
#Configure as a service
sudo ./svc.sh install ${AGENT_USERNAME}
#Start service
sudo ./svc.sh start

Use Container Registry in the virtual network


There are some prerequisites for securing an Azure Machine Learning workspace in a
virtual network. For more information, see Prerequisites. Container Registry is a required
service when you use an Azure Machine Learning workspace to train and deploy the
models.

In this example scenario, to ensure the self-hosted agent can access the container
registry in the virtual network, we use virtual network peering and add a virtual network
link to link the private DNS zone, privatelink.azurecr.io, to BASTION VNET. The following
Terraform snippet shows the implementation.

Terraform

# Azure Machine Learning Container Registry is for private access


# by the Azure Machine Learning workspace
resource "azurerm_container_registry" "acr" {
name = "my_acr"
resource_group_name = "my_resource_group"
location = "eastasia"
sku = "Premium"
admin_enabled = true
public_network_access_enabled = false
}

resource "azurerm_private_dns_zone" "acr_zone" {


name = "privatelink.azurecr.io"
resource_group_name = "my_resource_group"
}

resource "azurerm_private_dns_zone_virtual_network_link" "acr_zone_link" {


name = "link_acr"
resource_group_name = "my_resource_group"
private_dns_zone_name = azurerm_private_dns_zone.acr_zone.name
virtual_network_id = azurerm_virtual_network.amlvnet.id
}

resource "azurerm_private_endpoint" "acr_ep" {


name = "acr_pe"
resource_group_name = "my_resource_group"
location = "eastasia"
subnet_id = azurerm_subnet.aml_subnet.id

private_service_connection {
name = "acr_psc"
private_connection_resource_id = azurerm_container_registry.acr.id
subresource_names = ["registry"]
is_manual_connection = false
}

private_dns_zone_group {
name = "private-dns-zone-group-app-acr"
private_dns_zone_ids = [azurerm_private_dns_zone.acr_zone.id]
}
}

This example scenario also ensures that the container registry has a contributor role for
the system-assigned managed identity of the Azure Machine Learning workspace.

Use a compute cluster or instance in the virtual network


An Azure Machine Learning compute cluster or instance in a virtual network requires a
network security group (NSG) with some specific rules for its subnet. For a list of those
rules, see Limitations.

Also note that for the compute cluster or instance, it's now possible to remove the
public IP address, which helps provide better protection for compute resources in the
MLOps solution. For more information, see No public IP for compute instances.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Gary Wang | Principal Software Engineer

Other contributors:

Gary Moore | Programmer/Writer


To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Terraform on Azure documentation
Azure Machine Learning Enterprise Terraform Example
Azure MLOps (v2) solution accelerator
Azure Virtual Network pricing
Pricing for Azure DevOps

Related resources
Machine learning operations (MLOps) framework to upscale machine learning
lifecycle with Azure Machine Learning
Secure an Azure Machine Learning workspace with virtual networks
Azure Pipelines agents
Machine Learning operations
maturity model
Azure Machine Learning

The purpose of this maturity model is to help clarify the Machine Learning Operations
(MLOps) principles and practices. The maturity model shows the continuous
improvement in the creation and operation of a production level machine learning
application environment. You can use it as a metric for establishing the progressive
requirements needed to measure the maturity of a machine learning production
environment and its associated processes.

Maturity model
The MLOps maturity model helps clarify the Development Operations (DevOps)
principles and practices necessary to run a successful MLOps environment. It's intended
to identify gaps in an existing organization's attempt to implement such an
environment. It's also a way to show you how to grow your MLOps capability in
increments rather than overwhelm you with the requirements of a fully mature
environment. Use it as a guide to:

Estimate the scope of the work for new engagements.

Establish realistic success criteria.

Identify deliverables you'll hand over at the conclusion of the engagement.

As with most maturity models, the MLOps maturity model qualitatively assesses
people/culture, processes/structures, and objects/technology. As the maturity level
increases, the probability increases that incidents or errors will lead to improvements in
the quality of the development and production processes.

The MLOps maturity model encompasses five levels of technical capability:

Level Description Highlights Technology

0 No MLOps Difficult to manage full Manual builds


machine learning model and deployments
lifecycle Manual testing
of model and
Level Description Highlights Technology

The teams are disparate application


and releases are painful No centralized
Most systems exist as tracking of
"black boxes," little model
feedback during/post performance
deployment Training of
model is manual

1 DevOps but Releases are less painful Automated


no MLOps than No MLOps, but rely builds
on Data Team for every Automated tests
new model for application
Still limited feedback on code
how well a model
performs in production
Difficult to
trace/reproduce results

2 Automated Training environment is Automated


Training fully managed and model training
traceable Centralized
Easy to reproduce model tracking of
Releases are manual, but model training
low friction performance
Model
management

3 Automated Releases are low friction Integrated A/B


Model and automatic testing of model
Deployment Full traceability from performance for
deployment back to deployment
original data Automated tests
Entire environment for all code
managed: train > test > Centralized
production tracking of
model training
performance

4 Full MLOps Full system automated Automated


Automated and easily monitored model training
Operations Production systems are and testing
providing information on
Level Description Highlights Technology

how to improve and, in Verbose,


some cases, centralized
automatically improve metrics from
with new models deployed model
Approaching a zero-
downtime system

The tables that follow identify the detailed characteristics for that level of process
maturity. The model will continue to evolve. This version was last updated in January
2020.

Level 0: No MLOps
People Model Creation Model Release Application
Integration

Data scientists: Data gathered Manual Heavily


siloed, not in manually process reliant on
regular Compute is Scoring data
communications likely not script may scientist
with the larger managed be manually expertise
team Experiments created well to
Data engineers aren't after implement
(if exists): siloed, predictably experiments, Manual
not in regular tracked not version releases
communications End result may controlled each time
with the larger be a single Release
team model file handled by
Software manually data
engineers: handed off scientist or
siloed, receive with data
model remotely inputs/outputs engineer
from the other alone
team members

Level 1: DevOps no MLOps


People Model Creation Model Release Application
Integration

Data scientists: Data pipeline Manual Basic


siloed, not in gathers data process integration
regular automatically Scoring tests exist
communications Compute is or script may for the
with the larger isn't managed be manually model
team Experiments created well Heavily
Data engineers aren't after reliant on
(if exists): siloed, predictably experiments, data
not in regular tracked likely scientist
communication End result may version expertise
with the larger be a single controlled to
team model file Is handed implement
Software manually off to model
engineers: handed off software Releases
siloed, receive with engineers automated
model remotely inputs/outputs Application
from the other code has
team members unit tests

Level 2: Automated Training


People Model Creation Model Release Application
Integration

Data scientists: Data pipeline Manual Basic


Working gathers data release integration
directly with automatically Scoring tests exist
data engineers Compute script is for the
to convert managed version model
experimentation Experiment controlled Heavily
code into results with tests reliant on
repeatable tracked Release data
scripts/jobs Both training managed scientist
Data engineers: code and by Software expertise
Working with resulting engineering to
data scientists models are team implement
Software version model
engineers: controlled Application
siloed, receive code has
model remotely unit tests
from the other
Peopleteam members Model Creation Model Release Application
Integration

Level 3: Automated Model Deployment


People Model Creation Model Release Application
Integration

Data scientists: Data pipeline Automatic Unit and


Working gathers data release integration
directly with automatically Scoring tests for each
data engineers Compute script is model release
to convert managed version Less reliant on
experimentation Experiment controlled data scientist
code into results with tests expertise to
repeatable tracked Release implement
scripts/jobs Both training managed model
Data engineers: code and by Application
Working with resulting continuous code has
data scientists models are delivery unit/integration
and software version (CI/CD) tests
engineers to controlled pipeline
manage
inputs/outputs
Software
engineers:
Working with
data engineers
to automate
model
integration into
application
code

Level 4: Full MLOps Automated Retraining


People Model Creation Model Release Application
Integration

Data scientists: Data pipeline Automatic Unit and


Working gathers data Release Integration
People Model Creation Model Release Application
Integration

directly with automatically Scoring tests for each


data engineers Retraining Script is model release
to convert triggered version Less reliant on
experimentation automatically controlled data scientist
code into based on with tests expertise to
repeatable production Release implement
scripts/jobs. metrics managed model
Working with Compute by Application
software managed continuous code has
engineers to Experiment integration unit/integration
identify markers results and CI/CD tests
for data tracked pipeline
engineers Both training
Data engineers: code and
Working with resulting
data scientists models are
and software version
engineers to controlled
manage
inputs/outputs
Software
engineers:
Working with
data engineers
to automate
model
integration into
application
code.
Implementing
post-
deployment
metrics
gathering

Next steps
Learning path: Introduction to machine learning operations (MLOps)
Training module: Start the machine learning lifecycle with MLOps
MLOps: Model management, deployment, and monitoring with Azure Machine
Learning
What are Azure Machine Learning pipelines?
Related resources
Machine learning operations (MLOps) framework to upscale machine learning
lifecycle with Azure Machine Learning
Orchestrate MLOps by using Azure Databricks
Secure MLOps solutions with Azure network security
MLOps for Python models using Azure Machine Learning
Machine learning operations
(MLOps) framework to upscale
machine learning lifecycle with
Azure Machine Learning
Azure Data Factory Azure Machine Learning

This client project helped a Fortune 500 food company improve its demand forecasting. The
company ships products directly to multiple retail outlets. The improvement helped them
optimize the stocking of their products in different stores across several regions of the United
States. To achieve this, Microsoft's Commercial Software Engineering (CSE) team worked with
the client's data scientists on a pilot study to develop customized machine learning models for
the selected regions. The models take into account:

Shopper demographics
Historical and forecasted weather
Past shipments
Product returns
Special events

The goal to optimize stocking represented a major component of the project and the client
realized a significant sales lift in the early field trials. Also, the team saw a 40% reduction in
forecasting mean absolute percentage error (MAPE) when compared with a historical average
baseline model.

A key part of the project was figuring out how to scale up the data science workflow from the
pilot study to a production level. This production-level workflow required the CSE team to:

Develop models for many regions.


Continuously update and monitor performance of the models.
Facilitate collaboration between the data and engineering teams.

The typical data science workflow today is closer to a one-off lab environment than a
production workflow. An environment for data scientists must be suitable for them to:

Prepare the data.


Experiment with different models.
Tune hyperparameters.
Create a build-test-evaluate-refine cycle.

Most tools that are used for these tasks have specific purposes and aren't well suited to
automation. In a production level machine learning operation, there must be more
consideration given to application lifecycle management and DevOps.

The CSE team helped the client scale up the operation to production levels. They implemented
various aspects of continuous integration (CI)/continuous delivery (CD) capabilities and
addressed issues like observability, and integration with Azure capabilities. During the
implementation, the team uncovered gaps in existing MLOps guidance. Those gaps needed to
be filled so that MLOps was better understood and applied at scale.

Understanding MLOps practices helps organizations ensure that the machine learning models
that the system produces are production quality models that improve business performance.
When MLOps is implemented, the organization no longer has to spend as much of their time
on low-level details relating to the infrastructure and engineering work that's required to
develop and run machine learning models for production level operations. Implementing
MLOps also helps the data science and software engineering communities to learn to work
together to deliver a production-ready system.

The CSE team used this project to address machine learning community needs by addressing
issues like developing an MLOps maturity model. These efforts were aimed at improving
MLOps adoption by understanding the typical challenges of the key players in the MLOps
process.

Engagement and technical scenarios


The engagement scenario discusses the real-world challenges that the CSE team had to solve.
The technical scenario defines the requirements to create an MLOps lifecycle that's as reliable
as the well established DevOps lifecycle.

Engagement scenario
The client delivers products directly to retail market outlets on a regular schedule. Each retail
outlet varies in its product usage patterns, so product inventory needs to vary in each weekly
delivery. Maximizing sales and minimizing product returns and lost sales opportunities are the
goals of the demand forecasting methodologies that the client uses. This project focused on
using machine learning to improve the forecasts.

The CSE team divided the project into two phases. Phase 1 focused on developing machine
learning models to support a field-based pilot study on the effectiveness of machine learning
forecasting for a selected sales region. The success of Phase 1 led to Phase 2, in which the
team scaled up the initial pilot study from a minimal group of models that supported a single
geographic region to a set of sustainable production-level models for all of the client's sales
regions. A primary consideration for the scaled up solution was the need to accommodate the
large number of geographic regions and their local retail outlets. The team dedicated the
machine learning models to both large and small retail outlets in each region.
The Phase 1 pilot study determined that a model dedicated to one region's retail outlets could
use local sales history, local demographics, weather, and special events to optimize the
demand forecast for the outlets in the region. Four ensemble machine learning forecasting
models served market outlets in a single region. The models processed data in weekly batches.
Also, the team developed two baseline models using historical data for comparison.

For the first version of the scaled up Phase 2 solution, the CSE team selected 14 geographic
regions to participate, including small and large market outlets. They used more than 50
machine learning forecasting models. The team expected further system growth and continued
refinement of the machine learning models. It quickly became clear that this wider-scaled
machine learning solution is sustainable only if it's based on the best practice principles of
DevOps for the machine learning environment.

Environment Market Format Models Model Model


Region Subdivision Description

Dev Each Large format Two Slow Slow and


environment geographic stores ensemble moving fast both
market/region (supermarkets, models products have an
(for example big box stores, ensemble of
North Texas) and so on) a least
absolute
shrinkage
and
selection
operator
(LASSO)
linear
regression
model and
a neural
network
with
categorical
embeddings

Fast moving Slow and


products fast both
have an
ensemble of
a LASSO
linear
regression
model and
a neural
network
with
Environment Market Format Models Model Model
Region Subdivision Description

categorical
embeddings

One N/A Historical


ensemble average
model

Small format Two Slow Slow and


stores ensemble moving fast both
(pharmacies, models products have an
convenience ensemble of
stores, and so a LASSO
on) linear
regression
model and
a neural
network
with
categorical
embeddings

Fast moving Slow and


products both have
an
ensemble of
a LASSO
linear
regression
model and
a neural
network
with
categorical
embeddings

One N/A Historical


ensemble average
model

Same as
above for an
additional 13
geographic
regions
Environment Market Format Models Model Model
Region Subdivision Description

Same as
above for the
prod
environment

The MLOps process provided a framework for the scaled up system that addressed the full
lifecycle of the machine learning models. The framework includes development, testing,
deployment, operation, and monitoring. It fulfills the needs of a classic CI/CD process.
However, because of its relative immaturity compared to DevOps, it became evident that
existing MLOps guidance had gaps. The project team worked to fill in some of those gaps.
They wanted to provide a functional process model that insures the viability of the scaled up
machine learning solution.

The MLOps process that was developed from this project made a significant real-world step to
move MLOps to a higher level of maturity and viability. The new process is directly applicable
to other machine learning projects. The CSE team used what they learned to build a draft of an
MLOps maturity model that anyone can apply to other machine learning projects.

Technical scenario
MLOps, also known as DevOps for machine learning, is an umbrella term that encompasses
philosophies, practices, and technologies that are related to implementing machine learning
lifecycles in a production environment. It's still a relatively new concept. There have been many
attempts to define what MLOps is and many people have questioned whether MLOps can
subsume everything from how data scientists prepare data to how they ultimately deliver,
monitor, and evaluate machine learning results. While DevOps has had years to develop a set
of fundamental practices, MLOps is still early in its development. As it evolves, we discover the
challenges of bringing together two disciplines that often operate with different skill sets and
priorities: software/ops engineering, and data science.

Implementing MLOps in real-world production environments has unique challenges that must
be overcome. Teams can use Azure to support MLOps patterns. Azure can also provide clients
with asset management and orchestration services for effectively managing the machine
learning lifecycle. Azure services are the foundation for the MLOps solution that we describe in
this article.

Machine learning model requirements


Much of the work during the Phase 1 pilot field study was creating the machine learning
models that the CSE team applied to the large and small retail stores in a single region.
Notable requirements for the models included:

Use of the Azure Machine Learning service.

Initial experimental models that were developed in Jupyter notebooks and implemented
in Python.

7 Note

Teams used the same machine learning approach for large and small stores, but the
training and scoring data depended on the size of the store.

Data that requires preparation for model consumption.

Data that's processed on a batch basis rather than in real time.

Model retraining whenever code or data changes, or the model goes stale.

Viewing of model performance in Power BI dashboards.

Model performance in scoring that's considered significant when MAPE <= 45% when
compared with a historical average baseline model.

MLOps requirements
The team had to meet several key requirements to scale up the solution from the Phase 1 pilot
field study, in which only a few models were developed for a single sales region. Phase 2
implemented custom machine learning models for multiple regions. The implementation
included:

Weekly batch processing for large and small stores in each region to retrain the models
with new datasets.

Continuous refinement of the machine learning models.

Integration of the development/test/package/test/deploy process common to CI/CD in a


DevOps-like processing environment for MLOps.

7 Note

This represents a shift in how data scientists and data engineers have commonly
worked in the past.

A unique model that represented each region for large and small stores based on store
history, demographics, and other key variables. The model had to process the entire
dataset to minimize the risk of processing error.
The ability to initially scale up to support 14 sales regions with plans to scale up further.

Plans for additional models for longer term forecasting for regions and other store
clusters.

Machine learning model solution


The machine learning lifecycle, also known as the data science lifecycle, fits roughly into the
following high-level process flow:

Deploy Model here can represent any operational use of the validated machine learning model.
Compared to DevOps, MLOps presents the additional challenge of integrating the machine
learning lifecycle into the typical CI/CD process.

The data science lifecycle doesn't follow the typical software development lifecycle. It includes
the use of Azure Machine Learning to train and score the models, so these steps had to be
included in the CI/CD automation.

Batch processing of data is the basis of the architecture. Two Azure Machine Learning pipelines
are central to the process, one for training and the other for scoring. This diagram shows the
data science methodology that was used for the initial phase of the client project:

The team tested several algorithms. They ultimately chose an ensemble design of a LASSO
linear regression model and a neural network with categorical embeddings. The team used the
same model, defined by the level of product that the client could store on site, for both large
and small stores. The team further subdivided the model into fast-moving and slow-moving
products.

The data scientists train the machine learning models when the team releases new code and
when new data is available. Training typically happens weekly. Consequently, each processing
run involves a large amount of data. Because the team collects the data from many sources in
different formats, it requires conditioning to put the data into a consumable format before the
data scientists can process it. The data conditioning requires significant manual effort and the
CSE team identified it as a primary candidate for automation.

As mentioned, the data scientists developed and applied the experimental Azure Machine
Learning models to a single sales region in the Phase 1 pilot field study to evaluate the
usefulness of this forecasting approach. The CSE team judged that the sales lift for the stores in
the pilot study was significant. This success justified applying the solution to full production
levels in Phase 2, starting with 14 geographic regions and thousands of stores. The team could
then use the same pattern to add additional regions.

The pilot model served as the basis for the scaled up solution, but the CSE team knew that the
model needed further refinement on a continuing basis to improve its performance.

MLOps solution
As MLOps concepts mature, teams often discover challenges in bringing the data science and
DevOps disciplines together. The reason is that the principal players in the disciplines, software
engineers and data scientists, operate with different skill sets and priorities.

But there are similarities to build on. MLOps, like DevOps, is a development process
implemented by a toolchain. The MLOps toolchain includes such things as:

Version control
Code analysis
Build automation
Continuous integration
Testing frameworks and automation
Compliance policies integrated into CI/CD pipelines
Deployment automation
Monitoring
Disaster recovery and high availability
Package and container management

As noted above, the solution takes advantage of existing DevOps guidance, but is augmented
to create a more mature MLOps implementation that meets the needs of the client and of the
data science community. MLOps builds on DevOps guidance with these additional
requirements:
Data and model versioning isn't the same as code versioning: There must be versioning
of datasets as the schema and origin data changes.
Digital audit trail requirements: Track all changes when dealing with code and client
data.
Generalization: Models are different than code for reuse, since data scientists must tune
models based on input data and scenario. To reuse a model for a new scenario, you may
need to fine-tune/transfer/learn on it. You need the training pipeline.
Stale models: Models tend to decay over time and you need the ability to retrain them
on demand to ensure they remain relevant in production.

MLOps challenges

Immature MLOps standard


The standard pattern for MLOps is still evolving. A solution is typically built from scratch and
made to fit the needs of a particular client or user. The CSE team recognized this gap and
sought to use DevOps best practices in this project. They augmented the DevOps process to fit
the additional requirements of MLOps. The process the team developed is a viable example of
what an MLOps standard pattern should look like.

Differences in skill sets


Software engineers and data scientists bring unique skill sets to the team. These different skill
sets can make finding a solution that fits everyone's needs difficult. Building a well-understood
workflow for model delivery from experimentation to production is important. Team members
must share an understanding of how they can integrate changes into the system without
breaking the MLOps process.

Managing multiple models


There's often a need for multiple models to solve for difficult machine learning scenarios. One
of the challenges of MLOps is managing these models, including:

Having a coherent versioning scheme.


Continually evaluating and monitoring all the models.

Traceable lineage of both code and data is also needed to diagnose model issues and create
reproducible models. Custom dashboards can make sense of how deployed models are
performing and indicate when to intervene. The team created such dashboards for this project.

Need for data conditioning


Data used with these models comes from many private and public sources. Because the
original data is disorganized, it's impossible for the machine learning model to consume it in
its raw state. The data scientists must condition the data into a standard format for machine
learning model consumption.

Much of the pilot field test focused on conditioning the raw data so that the machine learning
model could process it. In an MLOps system, the team should automate this process, and track
the outputs.

MLOps maturity model


The purpose of the MLOps maturity model is to clarify the principles and practices and to
identify gaps in an MLOps implementation. It's also a way to show a client how to
incrementally grow their MLOps capability instead of trying to do it all at once. The client
should use it as a guide to:

Estimate the scope of the work for the project.


Establish success criteria.
Identify deliverables.

The MLOps maturity model defines five levels of technical capability:

Level Description

0 No Ops

1 DevOps but no MLOps

2 Automated training

3 Automated model deployment

4 Automated operations (full MLOps)

For the current version of the MLOps maturity model, see the MLOps maturity model article.

MLOps process definition


MLOps includes all activities from acquiring raw data to delivering model output, also known
as scoring:

Data conditioning
Model training
Model testing and evaluation
Build definition and pipeline
Release pipeline
Deployment
Scoring

Basic machine learning process


The basic machine learning process resembles traditional software development, but there are
significant differences. This diagram illustrates the major steps in the machine learning process:

The Experiment phase is unique to the data science lifecycle, which reflects how data scientists
traditionally do their work. It differs from how code developers do their work. The following
diagram illustrates this lifecycle in more detail.
Integrating this data development process into MLOps poses a challenge. Here you see the
pattern that the team used to integrate the process into a form that MLOps can support:

The role of MLOps is to create a coordinated process that can efficiently support the large-
scale CI/CD environments that are common in production level systems. Conceptually, the
MLOps model must include all process requirements from experimentation to scoring.

The CSE team refined the MLOps process to fit the client's specific needs. The most notable
need was batch processing instead of real-time processing. As the team developed the scaled
up system, they identified and resolved some shortcomings. The most significant of these
shortcomings led to the development of a bridge between Azure Data Factory and Azure
Machine Learning, which the team implemented by using a built-in connector in Azure Data
Factory. They created this component set to facilitate the triggering and status monitoring
necessary to make the process automation work.

Another fundamental change was that the data scientists needed the capability to export
experimental code from Jupyter notebooks into the MLOps deployment process rather than
trigger training and scoring directly.

Here is the final MLOps process model concept:

) Important

Scoring is the final step. The process runs the machine learning model to make
predictions. This addresses the basic business use case requirement for demand
forecasting. The team rates the quality of the predictions using the MAPE, which is a
measure of prediction accuracy of statistical forecasting methods and a loss function for
regression problems in machine learning. In this project, the team considered a MAPE of
<= 45% significant.

MLOps process flow


The following diagram describes how to apply CI/CD development and release workflows to
the machine learning lifecycle:
When a pull request (PR) is created from a feature branch, the pipeline runs code
validation tests to validate the quality of the code via unit tests and code quality tests. To
validate quality upstream, the pipeline also runs basic model validation tests to validate
the end-to-end training and scoring steps with a sample set of mocked data.
When the PR is merged into the main branch, the CI pipeline will run the same code
validation tests and basic model validation tests with increased epoch. The pipeline will
then package the artifacts, which include the code and binaries, to run in the machine
learning environment.
After the artifacts are available, a model validation CD pipeline is triggered. It runs end-
to-end validation on the development machine learning environment. A scoring
mechanism is published. For a batch scoring scenario, a scoring pipeline is published to
the machine learning environment and triggered to produce results. If you want to use a
real-time scoring scenario, you can publish a web app or deploy a container.
Once a milestone is created and merged into the release branch, the same CI pipeline
and model validation CD pipeline are triggered. This time, they run against the code from
the release branch.

You can consider the MLOps process data flow shown above as an archetype framework for
projects that make similar architectural choices.

Code validation tests


Code validation tests for machine learning focus on validating the quality of the code base. It's
the same concept as any engineering project that has code quality tests (linting), unit tests,
and code coverage measurements.

Basic model validation tests


Model validation typically refers to validating the full end-to-end process steps required to
produce a valid machine learning model. It includes steps like:

Data validation: Ensures that the input data is valid.


Training validation: Ensures that the model can be successfully trained.
Scoring validation: Ensures that the team can successfully use the trained model for
scoring with the input data.

Running this full set of steps on the machine learning environment is expensive and time
consuming. As a result, the team did basic model validation tests locally on a development
machine. It ran the steps above and used the following:

Local testing dataset: A small dataset, often one that's obfuscated, that's checked in to
the repository and consumed as the input data source.
Local flag: A flag or argument in the model's code that indicates that the code intends
the dataset to run locally. The flag tells the code to bypass any call to the machine
learning environment.

This goal of these validation tests isn't to evaluate the performance of the trained model.
Rather, it's to validate that the code for the end-to-end process is of good quality. It assures
the quality of the code that's pushed upstream, like the incorporation of model validation tests
in the PR and CI build. It also makes it possible for engineers and data scientists to put
breakpoints into the code for debugging purposes.

Model validation CD pipeline


The goal of the model validation pipeline is to validate the end-to-end model training and
scoring steps on the machine learning environment with actual data. Any trained model that's
produced will be added to the model registry and tagged, to await promotion after validation
is completed. For batch prediction, promotion can be the publishing of a scoring pipeline that
uses this version of the model. For real-time scoring, the model can be tagged to indicate that
it has been promoted.

Scoring CD pipeline
The scoring CD pipeline is applicable for the batch inference scenario, where the same model
orchestrator that's used for model validation triggers the published scoring pipeline.

Development vs. production environments


It's a good practice to separate the development (dev) environment from the production
(prod) environment. Separation allows the system to trigger the model validation CD pipeline
and scoring CD pipeline on different schedules. For the described MLOps flow, pipelines
targeting the main branch run in the dev environment, and the pipeline that targets the release
branch runs in the prod environment.

Code changes vs. data changes


The previous sections deal mostly with how to handle code changes from development to
release. However, data changes should follow the same rigor as code changes to provide the
same validation quality and consistency in production. With a data change trigger or a timer
trigger, the system can trigger the model validation CD pipeline and the scoring CD pipeline
from the model orchestrator to run the same process that's run for code changes in the release
branch prod environment.

MLOps personas and roles


A key requirement for any MLOps process is that it meet the needs of the many users of the
process. For design purposes, consider these users as individual personas. For this project, the
team identified these personas:

Data scientist: Creates the machine learning model and its algorithms.
Engineer
Data engineer: Handles data conditioning.
Software engineer: Handles model integration into the asset package and the CI/CD
workflow.
Operations or IT: Oversees system operations.
Business stakeholder: Concerned with the predictions made by the machine learning
model and how they help the business.
Data end user: Consumes model output in some way that aids in making business
decisions.

The team had to address three key findings from the persona and role studies:

Data scientists and engineers have a mismatch of approach and skills in their work.
Making it easy for the data scientist and the engineer to collaborate is a major
consideration for the design of the MLOps process flow. It requires new skill acquisitions
by all team members.
There's a need to unify all of the principal personas without alienating anyone. A way to
do this is to:
Make sure they understand the conceptual model for MLOps.
Agree on the team members that will work together.
Establish working guidelines to achieve common goals.
If the business stakeholder and data end user need a way to interact with the data output
from the models, a user-friendly UI is the standard solution.
Other teams will certainly come across similar issues in other machine learning projects as they
scale up for production use.

MLOps solution architecture

Logical architecture

The data comes from many sources in many different formats, so it's conditioned before it's
inserted into the data lake. The conditioning is done by using microservices operating as Azure
Functions. The clients customize the microservices to fit the data sources and transform them
into a standardized csv format that the training and scoring pipelines consume.

System architecture

Batch processing architecture


The team devised the architectural design to support a batch data processing scheme. There
are alternatives, but whatever is used must support MLOps processes. Full use of available
Azure services was a design requirement. The following diagram shows the architecture:

Solution overview
Azure Data Factory does the following:

Triggers an Azure Function to start data ingestion and a run of the Azure Machine
Learning pipeline.
Launches a durable function to poll the Azure Machine Learning pipeline for completion.

Custom dashboards in Power BI display the results. Other Azure dashboards that are
connected to SQL Azure, Azure Monitor, and App Insights via OpenCensus Python SDK, track
Azure resources. These dashboards provide information about the health of the machine
learning system. They also yield data that the client uses for product order forecasting.

Model orchestration
Model orchestration follows these steps:
1. When a PR is submitted, DevOps triggers a code validation pipeline.
2. The pipeline runs unit tests, code quality tests, and model validation tests.
3. When merged into the main branch, the same code validation tests are run, and DevOps
packages the artifacts.
4. DevOps collecting of artifacts triggers Azure Machine Learning to do:
a. Data validation.
b. Training validation.
c. Scoring validation.
5. After validation completes, the final scoring pipeline runs.
6. Changing data and submitting a new PR triggers the validation pipeline again, followed
by the final scoring pipeline.

Enable experimentation
As mentioned, the traditional data science machine learning lifecycle doesn't support the
MLOps process without modification. It uses different kinds of manual tools and
experimentation, validation, packaging, and model handoff that can't be easily scaled for an
effective CI/CD process. MLOps demands a high level of process automation. Whether a new
machine learning model is being developed or an old one is modified, it's necessary to
automate the lifecycle of the machine learning model. In the Phase 2 project, the team used
Azure DevOps to orchestrate and republish Azure Machine Learning pipelines for training
tasks. The long-running main branch performs basic testing of models, and pushes stable
releases through the long-running release branch.

Source control becomes an important part of this process. Git is the version control system
that's used to track notebook and model code. It also supports process automation. The basic
workflow that's implemented for source control applies the following principles:

Use formal versioning for code and datasets.


Use a branch for new code development until the code is fully developed and validated.
After new code is validated, it can be merged into the main branch.
For a release, a permanent versioned branch is created that's separate from the main
branch.
Use versions and source control for the datasets that have been conditioned for training
or consumption, so that you can maintain the integrity of each dataset.
Use source control to track your Jupyter Notebook experiments.

Integration with data sources


Data scientists use many raw data sources and processed datasets to experiment with different
machine learning models. The volume of data in a production environment can be
overwhelming. For the data scientists to experiment with different models, they need to use
management tools like Azure Data Lake. The requirement for formal identification and version
control applies to all raw data, prepared datasets, and machine learning models.
In the project, the data scientists conditioned the following data for input into the model:

Historical weekly shipment data since January 2017


Historical and forecasted daily weather data for each zip code
Shopper data for each store ID

Integration with source control


To get data scientists to apply engineering best practices, it's necessary to conveniently
integrate the tools they use with source control systems like GitHub. This practice allows for
machine learning model versioning, collaboration between team members, and disaster
recovery should the teams experience a loss of data or a system outage.

Model ensemble support


The model design in this project was an ensemble model. That is, data scientists used many
algorithms in the final model design. In this case, the models used the same basic algorithm
design. The only difference was that they used different training data and scoring data. The
models used the combination of a LASSO linear regression algorithm and a neural network.

The team explored, but did not implement, an option to carry the process forward to the point
where it would support having many real-time models running in production to service a given
request. This option can accommodate the use of ensemble models in A/B testing and
interleaved experiments.

End-user interfaces
The team developed end-user UIs for observability, monitoring, and instrumentation. As
mentioned, dashboards visually display the machine learning model data. These dashboards
show the following data in a user-friendly format:

Pipeline steps, including pre-processing the input data.


To monitor the health of the machine learning model processing:
What metrics do you collect from your deployed model?
MAPE: Mean absolute percentage error, the key metric to track for overall
performance. (Target a MAPE value of <= 0.45 for each model.)
RMSE 0: Root-mean-square error (RMSE) when the actual target value = 0.
RMSE All: RMSE on the entire dataset.
How do you evaluate if your model is performing as expected in production?
Is there a way to tell if production data is deviating too much from expected values?
Is your model performing poorly in production?
Do you have a failover state?
Track the quality of the processed data.
Display the scoring/predictions produced by the machine learning model.
The application populates the dashboards according to the nature of the data and how it
processes and analyzes the data. As such, the team must design the exact layout of the
dashboards for each use case. Here are two sample dashboards:
The dashboards were designed to provide readily usable information for consumption by the
end user of the machine learning model predictions.

7 Note

Stale models are scoring runs where the data scientists trained the model used for scoring
more than 60 days from when scoring took place. the Scoring page of the ML Monitor
dashboard displays this health metric.

Components
Azure Machine Learning
Azure Machine Learning Compute
Azure Machine Learning Pipelines
Azure Machine Learning Model Registry
Azure Blob Storage
Azure Data Lake Storage
Azure Pipelines
Azure Data Factory
Azure Functions for Python
Azure Monitor
Logs
Application Insights
Azure SQL Database
Azure Dashboards
Power BI

Considerations
Here you'll find a list of considerations to explore. They're based on the lessons the CSE team
learned during the project.

Environment considerations
Data scientists develop most of their machine learning models by using Python, often
starting with Jupyter notebooks. It can be a challenge to implement these notebooks as
production code. Jupyter notebooks are more of an experimental tool, while Python
scripts are more appropriate for production. Teams often need to spend time refactoring
model creation code into Python scripts.
Make clients who are new to DevOps and machine learning aware that experimentation
and production require different rigor, so it's good practice to separate the two.
Tools like the Azure Machine Learning Visual Designer or AutoML can be effective in
getting basic models off the ground while the client ramps up on standard DevOps
practices to apply to the rest of the solution.
Azure DevOps has plug-ins that can integrate with Azure Machine Learning to help
trigger pipeline steps. The MLOpsPython repo has a few examples of such pipelines.
Machine learning often requires powerful GPU machines for training. If the client doesn't
already have such hardware available, Azure Machine Learning compute clusters can
provide an effective path for quickly provisioning cost-effective powerful hardware that
autoscales. If a client has advanced security or monitoring needs, there are other options
such as standard VMs, Databricks, or local compute.
For a client to be successful, their model building teams (data scientists) and deployment
teams (DevOps engineers) need to have a strong communication channel. They can
accomplish this with daily stand-up meetings or a formal online chat service. Both
approaches help in integrating their development efforts in an MLOps framework.

Data preparation considerations


The simplest solution for using Azure Machine Learning is to store data in a supported
data storage solution. Tools such as Azure Data Factory are effective for piping data to
and from those locations on a schedule.
It's important for clients to frequently capture additional retraining data to keep their
models up to date. If they don't already have a data pipeline, creating one will be an
important part of the overall solution. Using a solution such as Datasets in Azure Machine
Learning can be useful for versioning data to help with traceability of models.

Model training and evaluation considerations


It's overwhelming for a client who is just getting started in their machine learning journey
to try to implement a full MLOps pipeline. If necessary, they can ease into it by using
Azure Machine Learning to track experiment runs and by using Azure Machine Learning
compute as the training target. These options might create a lower barrier of entry
solution to begin integrating Azure services.

Going from a notebook experiment to repeatable scripts is a rough transition for many
data scientists. The sooner you can get them writing their training code in Python scripts
the easier it will be for them to begin versioning their training code and enabling
retraining.

That isn't the only possible method. Databricks supports scheduling notebooks as jobs.
But, based on current client experience, this approach is difficult to instrument with full
DevOps practices because of testing limitations.

It's also important to understand what metrics are being used to consider a model a
success. Accuracy alone is often not good enough to determine the overall performance
of one model versus another.

Compute considerations
Customers should consider using containers to standardize their compute environments.
Nearly all Azure Machine Learning compute targets support using Docker . Having a
container handle the dependencies can reduce friction significantly, especially if the team
uses many compute targets.

Model serving considerations


The Azure Machine Learning SDK provides an option to deploy directly to Azure
Kubernetes Service from a registered model, creating limits on what security/metrics are
in place. You can try to find an easier solution for clients to test their model, but it's best
to develop a more robust deployment to AKS for production workloads.

Next steps
Learn more about MLOps
MLOps on Azure
Azure Monitor Visualizations
Machine Learning Lifecycle
Azure DevOps Machine Learning extension
Azure Machine Learning CLI
Trigger applications, processes, or CI/CD workflows based on Azure Machine Learning
events
Set up model training and deployment with Azure DevOps
Set up MLOps with Azure Machine Learning and Databricks

Related resources
MLOps maturity model
Orchestrate MLOps on Azure Databricks using Databricks Notebook
MLOps for Python models using Azure Machine Learning
Data science and machine learning with Azure Databricks
Citizen AI with the Power Platform
Deploy AI and machine learning computing on-premises and to the edge
What is the Team Data Science
Process?
Azure Machine Learning

The Team Data Science Process (TDSP) is an agile, iterative data science methodology to
deliver predictive analytics solutions and intelligent applications efficiently. TDSP helps
improve team collaboration and learning by suggesting how team roles work best
together. TDSP includes best practices and structures from Microsoft and other industry
leaders to help toward successful implementation of data science initiatives. The goal is
to help companies fully realize the benefits of their analytics program.

This article provides an overview of TDSP and its main components. We provide a
generic description of the process here that can be implemented with different kinds of
tools. A more detailed description of the project tasks and roles involved in the lifecycle
of the process is provided in additional linked topics. Guidance on how to implement
the TDSP using a specific set of Microsoft tools and infrastructure that we use to
implement the TDSP in our teams is also provided.

Key components of the TDSP


TDSP has the following key components:

A data science lifecycle definition


A standardized project structure
Infrastructure and resources recommended for data science projects
Tools and utilities recommended for project execution

Data science lifecycle


The Team Data Science Process (TDSP) provides a lifecycle to structure the development
of your data science projects. The lifecycle outlines the full steps that successful projects
follow.

If you are using another data science lifecycle, such as CRISP-DM , KDD, or your
organization's own custom process, you can still use the task-based TDSP in the context
of those development lifecycles. At a high level, these different methodologies have
much in common.
This lifecycle has been designed for data science projects that ship as part of intelligent
applications. These applications deploy machine learning or artificial intelligence models
for predictive analytics. Exploratory data science projects or improvised analytics
projects can also benefit from using this process. But in such cases some of the steps
described may not be needed.

The lifecycle outlines the major stages that projects typically execute, often iteratively:

Business Understanding
Data Acquisition and Understanding
Modeling
Deployment

Here is a visual representation of the Team Data Science Process lifecycle.

The goals, tasks, and documentation artifacts for each stage of the lifecycle in TDSP are
described in the Team Data Science Process lifecycle topic. These tasks and artifacts are
associated with project roles:

Solution architect
Project manager
Data engineer
Data scientist
Application developer
Project lead

The following diagram provides a grid view of the tasks (in blue) and artifacts (in green)
associated with each stage of the lifecycle (on the horizontal axis) for these roles (on the
vertical axis).

Standardized project structure


Having all projects share a directory structure and use templates for project documents
makes it easy for the team members to find information about their projects. All code
and documents are stored in a version control system (VCS) like Git, TFS, or Subversion
to enable team collaboration. Tracking tasks and features in an agile project tracking
system like Jira, Rally, and Azure DevOps allows closer tracking of the code for individual
features. Such tracking also enables teams to obtain better cost estimates. TDSP
recommends creating a separate repository for each project on the VCS for versioning,
information security, and collaboration. The standardized structure for all projects helps
build institutional knowledge across the organization.

We provide templates for the folder structure and required documents in standard
locations. This folder structure organizes the files that contain code for data exploration
and feature extraction, and that record model iterations. These templates make it easier
for team members to understand work done by others and to add new members to
teams. It is easy to view and update document templates in markdown format. Use
templates to provide checklists with key questions for each project to insure that the
problem is well defined and that deliverables meet the quality expected. Examples
include:

a project charter to document the business problem and scope of the project
data reports to document the structure and statistics of the raw data
model reports to document the derived features
model performance metrics such as ROC curves or MSE

The directory structure can be cloned from GitHub .

Infrastructure and resources for data science


projects
TDSP provides recommendations for managing shared analytics and storage
infrastructure such as:

cloud file systems for storing datasets


databases
big data (SQL or Spark) clusters
machine learning service

The analytics and storage infrastructure, where raw and processed datasets are stored,
may be in the cloud or on-premises. This infrastructure enables reproducible analysis. It
also avoids duplication, which may lead to inconsistencies and unnecessary
infrastructure costs. Tools are provided to provision the shared resources, track them,
and allow each team member to connect to those resources securely. It is also a good
practice to have project members create a consistent compute environment. Different
team members can then replicate and validate experiments.

Here is an example of a team working on multiple projects and sharing various cloud
analytics infrastructure components.

Tools and utilities for project execution


Introducing processes in most organizations is challenging. Tools provided to implement
the data science process and lifecycle help lower the barriers to and increase the
consistency of their adoption. TDSP provides an initial set of tools and scripts to jump-
start adoption of TDSP within a team. It also helps automate some of the common tasks
in the data science lifecycle such as data exploration and baseline modeling. There is a
well-defined structure provided for individuals to contribute shared tools and utilities
into their team's shared code repository. These resources can then be leveraged by
other projects within the team or the organization. Microsoft provides extensive tooling
inside Azure Machine Learning supporting both open-source (Python, R, ONNX, and
common deep-learning frameworks) and also Microsoft's own tooling (AutoML).

Next steps
Team Data Science Process: Roles and tasks Outlines the key personnel roles and their
associated tasks for a data science team that standardizes on this process.
The Team Data Science Process lifecycle
Article • 11/15/2022

The Team Data Science Process (TDSP) provides a recommended lifecycle that you can
use to structure your data-science projects. The lifecycle outlines the complete steps
that successful projects follow. If you use another data-science lifecycle, such as the
Cross Industry Standard Process for Data Mining (CRISP-DM) , Knowledge Discovery in
Databases (KDD) , or your organization's own custom process, you can still use the
task-based TDSP.

This lifecycle is designed for data-science projects that are intended to ship as part of
intelligent applications. These applications deploy machine learning or artificial
intelligence models for predictive analytics. Exploratory data-science projects and
improvised analytics projects can also benefit from the use of this process. But for those
projects, some of the steps described here might not be needed.

Five lifecycle stages


The TDSP lifecycle is composed of five major stages that are executed iteratively. These
stages include:

1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance

Here is a visual representation of the TDSP lifecycle:


The TDSP lifecycle is modeled as a sequence of iterated steps that provide guidance on
the tasks needed to use predictive models. You deploy the predictive models in the
production environment that you plan to use to build the intelligent applications. The
goal of this process lifecycle is to continue to move a data-science project toward a clear
engagement end point. Data science is an exercise in research and discovery. The ability
to communicate tasks to your team and your customers by using a well-defined set of
artifacts that employ standardized templates helps to avoid misunderstandings. Using
these templates also increases the chance of the successful completion of a complex
data-science project.

For each stage, we provide the following information:

Goals: The specific objectives.


How to do it: An outline of the specific tasks and guidance on how to complete
them.
Artifacts: The deliverables and the support to produce them.

Next steps
For examples of how to execute steps in TDSPs that use Azure Machine Learning, see
Use the TDSP with Azure Machine Learning.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Related resources
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
The business understanding stage of the
Team Data Science Process lifecycle
Article • 11/15/2022

This article outlines the goals, tasks, and deliverables associated with the business
understanding stage of the Team Data Science Process (TDSP). This process provides a
recommended lifecycle that you can use to structure your data-science projects. The
lifecycle outlines the major stages that projects typically execute, often iteratively:

1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance

Here is a visual representation of the TDSP lifecycle:

Goals
Specify the key variables that are to serve as the model targets and whose related
metrics are used determine the success of the project.
Identify the relevant data sources that the business has access to or needs to
obtain.

How to do it
There are two main tasks addressed in this stage:

Define objectives: Work with your customer and other stakeholders to understand
and identify the business problems. Formulate questions that define the business
goals that the data science techniques can target.
Identify data sources: Find the relevant data that helps you answer the questions
that define the objectives of the project.

Define objectives
1. A central objective of this step is to identify the key business variables that the
analysis needs to predict. We refer to these variables as the model targets, and we
use the metrics associated with them to determine the success of the project. Two
examples of such targets are sales forecasts or the probability of an order being
fraudulent.

2. Define the project goals by asking and refining "sharp" questions that are relevant,
specific, and unambiguous. Data science is a process that uses names and numbers
to answer such questions. You typically use data science or machine learning to
answer five types of questions:

How much or how many? (regression)


Which category? (classification)
Which group? (clustering)
Is this weird? (anomaly detection)
Which option should be taken? (recommendation)

Determine which of these questions you're asking and how answering it achieves
your business goals.

3. Define the project team by specifying the roles and responsibilities of its members.
Develop a high-level milestone plan that you iterate on as you discover more
information.
4. Define the success metrics. For example, you might want to achieve a customer
churn prediction. You need an accuracy rate of "x" percent by the end of this three-
month project. With this data, you can offer customer promotions to reduce churn.
The metrics must be SMART:

Specific
Measurable
Achievable
Relevant
Time-bound

Identify data sources


Identify data sources that contain known examples of answers to your sharp questions.
Look for the following data:

Data that's relevant to the question. Do you have measures of the target and
features that are related to the target?
Data that's an accurate measure of your model target and the features of interest.

For example, you might find that the existing systems need to collect and log additional
kinds of data to address the problem and achieve the project goals. In this situation, you
might want to look for external data sources or update your systems to collect new data.

Artifacts
Here are the deliverables in this stage:

Charter document : A standard template is provided in the TDSP project structure


definition. The charter document is a living document. You update the template
throughout the project as you make new discoveries and as business requirements
change. The key is to iterate upon this document, adding more detail, as you
progress through the discovery process. Keep the customer and other stakeholders
involved in making the changes and clearly communicate the reasons for the
changes to them.
Data sources : The Raw data sources section of the Data definitions report that's
found in the TDSP project Data report folder contains the data sources. This
section specifies the original and destination locations for the raw data. In later
stages, you fill in additional details like the scripts to move the data to your
analytic environment.
Data dictionaries : This document provides descriptions of the data that's
provided by the client. These descriptions include information about the schema
(the data types and information on the validation rules, if any) and the entity-
relation diagrams, if available.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Here are links to each step in the lifecycle of the TDSP:

1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Data acquisition and understanding
stage of the Team Data Science Process
Article • 11/15/2022

This article outlines the goals, tasks, and deliverables associated with the data
acquisition and understanding stage of the Team Data Science Process (TDSP). This
process provides a recommended lifecycle that you can use to structure your data-
science projects. The lifecycle outlines the major stages that projects typically execute,
often iteratively:

1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance

Here is a visual representation of the TDSP lifecycle:

Goals
Produce a clean, high-quality data set whose relationship to the target variables is
understood. Locate the data set in the appropriate analytics environment so you
are ready to model.
Develop a solution architecture of the data pipeline that refreshes and scores the
data regularly.

How to do it
There are three main tasks addressed in this stage:

Ingest the data into the target analytic environment.


Explore the data to determine if the data quality is adequate to answer the
question.
Set up a data pipeline to score new or regularly refreshed data.

Ingest the data


Set up the process to move the data from the source locations to the target locations
where you run analytics operations, like training and predictions. For technical details
and options on how to move the data with various Azure data services, see Load data
into storage environments for analytics.

Explore the data


Before you train your models, you need to develop a sound understanding of the data.
Real-world data sets are often noisy, are missing values, or have a host of other
discrepancies. You can use data summarization and visualization to audit the quality of
your data and provide the information you need to process the data before it's ready for
modeling. This process is often iterative. For guidance on cleaning the data, see Tasks to
prepare data for enhanced machine learning.

After you're satisfied with the quality of the cleansed data, the next step is to better
understand the patterns that are inherent in the data. This data analysis helps you
choose and develop an appropriate predictive model for your target. Look for evidence
for how well connected the data is to the target. Then determine whether there is
sufficient data to move forward with the next modeling steps. Again, this process is
often iterative. You might need to find new data sources with more accurate or more
relevant data to augment the data set initially identified in the previous stage.

Set up a data pipeline


In addition to the initial ingestion and cleaning of the data, you typically need to set up
a process to score new data or refresh the data regularly as part of an ongoing learning
process. Scoring may be completed with a data pipeline or workflow. The Move data
from a SQL Server instance to Azure SQL Database with Azure Data Factory article gives
an example of how to set up a pipeline with Azure Data Factory .

In this stage, you develop a solution architecture of the data pipeline. You develop the
pipeline in parallel with the next stage of the data science project. Depending on your
business needs and the constraints of your existing systems into which this solution is
being integrated, the pipeline can be one of the following options:

Batch-based
Streaming or real time
A hybrid

Artifacts
The following are the deliverables in this stage:

Data quality report : This report includes data summaries, the relationships
between each attribute and target, variable ranking, and more.
Solution architecture: The solution architecture can be a diagram or description of
your data pipeline that you use to run scoring or predictions on new data after you
have built a model. It also contains the pipeline to retrain your model based on
new data. Store the document in the Project directory when you use the TDSP
directory structure template.
Checkpoint decision: Before you begin full-feature engineering and model
building, you can reevaluate the project to determine whether the value expected
is sufficient to continue pursuing it. You might, for example, be ready to proceed,
need to collect more data, or abandon the project as the data does not exist to
answer the question.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Next steps
Here are links to each step in the lifecycle of the TDSP:

1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Modeling stage of the Team Data
Science Process lifecycle
Article • 11/15/2022

This article outlines the goals, tasks, and deliverables associated with the modeling stage
of the Team Data Science Process (TDSP). This process provides a recommended
lifecycle that you can use to structure your data-science projects. The lifecycle outlines
the major stages that projects typically execute, often iteratively:

1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance

Here is a visual representation of the TDSP lifecycle:

Goals
Determine the optimal data features for the machine-learning model.
Create an informative machine-learning model that predicts the target most
accurately.
Create a machine-learning model that's suitable for production.

How to do it
There are three main tasks addressed in this stage:

Feature engineering: Create data features from the raw data to facilitate model
training.
Model training: Find the model that answers the question most accurately by
comparing their success metrics.
Determine if your model is suitable for production.

Feature engineering
Feature engineering involves the inclusion, aggregation, and transformation of raw
variables to create the features used in the analysis. If you want insight into what is
driving a model, then you need to understand how the features relate to each other and
how the machine-learning algorithms are to use those features.

This step requires a creative combination of domain expertise and the insights obtained
from the data exploration step. Feature engineering is a balancing act of finding and
including informative variables, but at the same time trying to avoid too many unrelated
variables. Informative variables improve your result; unrelated variables introduce
unnecessary noise into the model. You also need to generate these features for any new
data obtained during scoring. As a result, the generation of these features can only
depend on data that's available at the time of scoring.

Model training
Depending on the type of question that you're trying to answer, there are many
modeling algorithms available. For guidance on choosing a prebuilt algorithm with
designer, see Machine Learning Algorithm Cheat Sheet for Azure Machine Learning
designer; other algorithms are available through open-source packages in R or Python.
Although this article focuses on Azure Machine Learning, the guidance it provides is
useful for any machine-learning projects.

The process for model training includes the following steps:


Split the input data randomly for modeling into a training data set and a test data
set.
Build the models by using the training data set.
Evaluate the training and the test data set. Use a series of competing machine-
learning algorithms along with the various associated tuning parameters (known as
a parameter sweep) that are geared toward answering the question of interest with
the current data.
Determine the "best" solution to answer the question by comparing the success
metrics between alternative methods.

See Train models with Azure Machine Learning for options on training models in Azure
Machine Learning.

7 Note

Avoid leakage: You can cause data leakage if you include data from outside the
training data set that allows a model or machine-learning algorithm to make
unrealistically good predictions. Leakage is a common reason why data scientists
get nervous when they get predictive results that seem too good to be true. These
dependencies can be hard to detect. To avoid leakage often requires iterating
between building an analysis data set, creating a model, and evaluating the
accuracy of the results.

Model Evaluation
After training, the data scientist focuses next on model evaluation.

Checkpoint decision: Evaluate whether the model performs sufficiently for


production. Some key questions to ask are:
Does the model answer the question with sufficient confidence given the test
data?
Should you try any alternative approaches?
Should you collect additional data, do more feature engineering, or experiment
with other algorithms?
Interpreting the Model: Use the Azure Machine Learning Python SDK to perform
the following tasks:
Explain the entire model behavior or individual predictions on your personal
machine locally.
Enable interpretability techniques for engineered features.
Explain the behavior for the entire model and individual predictions in Azure.
Upload explanations to Azure Machine Learning Run History.
Use a visualization dashboard to interact with your model explanations, both in
a Jupyter notebook and in the Azure Machine Learning workspace.
Deploy a scoring explainer alongside your model to observe explanations
during inferencing.
Assessing Fairness: The Fairlearn open-source Python package with Azure Machine
Learning performs the following tasks:
Assess the fairness of your model predictions. This process will help you learn
more about fairness in machine learning.
Upload, list, and download fairness assessment insights to/from Azure Machine
Learning studio.
See the fairness assessment dashboard in Azure Machine Learning studio to
interact with your model(s)' fairness insights.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Here are links to each step in the lifecycle of the TDSP:

1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Deployment stage of the Team Data
Science Process lifecycle
Article • 11/15/2022

This article outlines the goals, tasks, and deliverables associated with the deployment of
the Team Data Science Process (TDSP). This process provides a recommended lifecycle
that you can use to structure your data-science projects. The lifecycle outlines the major
stages that projects typically execute, often iteratively:

1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance

Here is a visual representation of the TDSP lifecycle:

Goal
Deploy models with a data pipeline to a production or production-like environment for
final user acceptance.

How to do it
The main task addressed in this stage:

Operationalize the model: Deploy the model and pipeline to a production or


production-like environment for application consumption.

Operationalize a model
After you have a set of models that perform well, you can operationalize them for other
applications to consume. Depending on the business requirements, predictions are
made either in real time or on a batch basis. To deploy models, you expose them with an
open API interface. The interface enables the model to be easily consumed from various
applications, such as:

Online websites
Spreadsheets
Dashboards
Line-of-business applications
Back-end applications

For examples of model operationalization with Azure Machine Learning, see Deploy
machine learning models to Azure. It is a best practice to build telemetry and
monitoring into the production model and the data pipeline that you deploy. This
practice helps with subsequent system status reporting and troubleshooting.

Artifacts
A status dashboard that displays the system health and key metrics
A final modeling report with deployment details
A final solution architecture document

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:
Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Here are links to each step in the lifecycle of the TDSP:

1. Business understanding
2. Data Acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance

For Azure, we recommend applying TDSP using Azure Machine Learning: for an
overview of Azure Machine Learning see What is Azure Machine Learning?.
Customer acceptance stage of the Team
Data Science Process lifecycle
Article • 11/15/2022

This article outlines the goals, tasks, and deliverables associated with the customer
acceptance stage of the Team Data Science Process (TDSP). This process provides a
recommended lifecycle that you can use to structure your data-science projects. The
lifecycle outlines the major stages that projects typically execute, often iteratively:

1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance

Here is a visual representation of the TDSP lifecycle:

Goal
Finalize project deliverables: Confirm that the pipeline, the model, and their
deployment in a production environment satisfy the customer's objectives.

How to do it
There are two main tasks addressed in this stage:

System validation: Confirm that the deployed model and pipeline meet the
customer's needs.
Project hand-off: Hand the project off to the entity that's going to run the system
in production.

The customer should validate that the system meets their business needs and that it
answers the questions with acceptable accuracy to deploy the system to production for
use by their client's application. All the documentation is finalized and reviewed. The
project is handed-off to the entity responsible for operations. This entity might be, for
example, an IT or customer data-science team or an agent of the customer that's
responsible for running the system in production.

Artifacts
The main artifact produced in this final stage is the Exit report of the project for the
customer. This technical report contains all the details of the project that are useful for
learning about how to operate the system. TDSP provides an Exit report template. You
can use the template as is, or you can customize it for specific client needs.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Here are links to each step in the lifecycle of the TDSP:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance

For Azure, we recommend applying TDSP using Azure Machine Learning: for an
overview of Azure Machine Learning see What is Azure Machine Learning?.
Team Data Science Process roles and
tasks
Article • 12/06/2022

The Team Data Science Process (TDSP) is a framework developed by Microsoft that
provides a structured methodology to efficiently build predictive analytics solutions and
intelligent applications. This article outlines the key personnel roles and associated tasks
for a data science team standardizing on this process.

This introductory article links to tutorials on how to set up the TDSP environment. The
tutorials provide detailed guidance for using Azure DevOps Projects, Azure Repos
repositories, and Azure Boards. The motivating goal is moving from concept through
modeling and into deployment.

The tutorials use Azure DevOps because that is how to implement TDSP at Microsoft.
Azure DevOps facilitates collaboration by integrating role-based security, work item
management and tracking, and code hosting, sharing, and source control. The tutorials
also use an Azure Data Science Virtual Machine (DSVM) as the analytics desktop,
which has several popular data science tools pre-configured and integrated with
Microsoft software and Azure services.

You can use the tutorials to implement TDSP using other code-hosting, agile planning,
and development tools and environments, but some features may not be available.

Structure of data science groups and teams


Data science functions in enterprises are often organized in the following hierarchy:

Data science group


Data science team/s within the group

In such a structure, there are group leads and team leads. Typically, a data science
project is done by a data science team. Data science teams have project leads for
project management and governance tasks, and individual data scientists and engineers
to perform the data science and data engineering parts of the project. The initial project
setup and governance is done by the group, team, or project leads.

Definition and tasks for the four TDSP roles


With the assumption that the data science unit consists of teams within a group, there
are four distinct roles for TDSP personnel:

1. Group Manager: Manages the entire data science unit in an enterprise. A data
science unit might have multiple teams, each of which is working on multiple data
science projects in distinct business verticals. A Group Manager might delegate
their tasks to a surrogate, but the tasks associated with the role do not change.

2. Team Lead: Manages a team in the data science unit of an enterprise. A team
consists of multiple data scientists. For a small data science unit, the Group
Manager and the Team Lead might be the same person.

3. Project Lead: Manages the daily activities of individual data scientists on a specific
data science project.

4. Project Individual Contributors: Data Scientists, Business Analysts, Data Engineers,


Architects, and others who execute a data science project.

7 Note

Depending on the structure and size of an enterprise, a single person may play
more than one role, or more than one person may fill a role.

Tasks to be completed by the four roles


The following diagram shows the top-level tasks for each Team Data Science Process
role. This schema and the following, more detailed outline of tasks for each TDSP role
can help you choose the tutorial you need based on your responsibilities.
Group Manager tasks
The Group Manager or a designated TDSP system administrator completes the
following tasks to adopt the TDSP:

Creates an Azure DevOps organization and a group project within the


organization.
Creates a project template repository in the Azure DevOps group project, and
seeds it from the project template repository developed by the Microsoft TDSP
team. The Microsoft TDSP project template repository provides:
A standardized directory structure, including directories for data, code, and
documents.
A set of standardized document templates to guide an efficient data science
process.
Creates a utility repository, and seeds it from the utility repository developed by
the Microsoft TDSP team. The TDSP utility repository from Microsoft provides a set
of useful utilities to make the work of a data scientist more efficient. The Microsoft
utility repository includes utilities for interactive data exploration, analysis,
reporting, and baseline modeling and reporting.
Sets up the security control policy for the organization account.

For detailed instructions, see Group Manager tasks for a data science team.
Team Lead tasks
The Team Lead or a designated project administrator completes the following tasks to
adopt the TDSP:

Creates a team project in the group's Azure DevOps organization.


Creates the project template repository in the project, and seeds it from the
group project template repository set up by the Group Manager or delegate.
Creates the team utility repository, seeds it from the group utility repository, and
adds team-specific utilities to the repository.
Optionally creates Azure file storage to store useful data assets for the team.
Other team members can mount this shared cloud file store on their analytics
desktops.
Optionally mounts the Azure file storage on the team's DSVM and adds team data
assets to it.
Sets up security control by adding team members and configuring their
permissions.

For detailed instructions, see Team Lead tasks for a data science team.

Project Lead tasks


The Project Lead completes the following tasks to adopt the TDSP:

Creates a project repository in the team project, and seeds it from the project
template repository.
Optionally creates Azure file storage to store the project's data assets.
Optionally mounts the Azure file storage to the DSVM and adds project data
assets to it.
Sets up security control by adding project members and configuring their
permissions.

For detailed instructions, see Project Lead tasks for a data science team.

Project Individual Contributor tasks


The Project Individual Contributor, usually a Data Scientist, conducts the following tasks
using the TDSP:

Clones the project repository set up by the project lead.


Optionally mounts the shared team and project Azure file storage on their Data
Science Virtual Machine (DSVM).
Executes the project.

For detailed instructions for onboarding onto a project, see Project Individual
Contributor tasks for a data science team.

Data science project execution workflow


By following the relevant tutorials, data scientists, project leads, and team leads can
create work items to track all tasks and stages for project from beginning to end. Using
Azure Repos promotes collaboration among data scientists and ensures that the
artifacts generated during project execution are version controlled and shared by all
project members. Azure DevOps lets you link your Azure Boards work items with your
Azure Repos repository branches and easily track what has been done for a work item.

The following figure outlines the TDSP workflow for project execution:

The workflow steps can be grouped into three activities:

Project Leads conduct sprint planning


Data Scientists develop artifacts on git branches to address work items
Project Leads or other team members do code reviews and merge working
branches to the primary branch

For detailed instructions on project execution workflow, see Agile development of data
science projects.

TDSP project template repository


Use the Microsoft TDSP team's project template repository to support efficient project
execution and collaboration. The repository gives you a standardized directory structure
and document templates you can use for your own TDSP projects.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Explore more detailed descriptions of the roles and tasks defined by the Team Data
Science Process:

Group Manager tasks for a data science team


Team Lead tasks for a data science team
Project Lead tasks for a data science team
Project Individual Contributor tasks for a data science team

Related resources
Team Data Science Process group manager tasks
Tasks for the team lead on a Team Data Science Process team
Project lead tasks in the Team Data Science Process
Tasks for an individual contributor in the Team Data Science Process
Team Data Science Process group
manager tasks
Article • 11/15/2022

This article describes the tasks that a group manager completes for a data science
organization. The group manager manages the entire data science unit in an enterprise.
A data science unit may have several teams, each of which is working on many data
science projects in distinct business verticals. The group manager's objective is to
establish a collaborative group environment that standardizes on the Team Data Science
Process (TDSP). For an outline of all the personnel roles and associated tasks handled by
a data science team standardizing on the TDSP, see Team Data Science Process roles
and tasks.

The following diagram shows the six main group manager setup tasks. Group managers
may delegate their tasks to surrogates, but the tasks associated with the role don't
change.

1. Set up an Azure DevOps organization for the group.


2. Create the default GroupCommon project in the Azure DevOps organization.
3. Create the GroupProjectTemplate repository in Azure Repos.
4. Create the GroupUtilities repository in Azure Repos.
5. Import the contents of the Microsoft TDSP team's ProjectTemplate and Utilities
repositories into the group common repositories.
6. Set up membership and permissions for team members to access the group.

The following tutorial walks through the steps in detail.

7 Note

This article uses Azure DevOps to set up a TDSP group environment, because that
is how to implement TDSP at Microsoft. If your group uses other code hosting or
development platforms, the Group Manager's tasks are the same, but the way to
complete them may be different.
Create an organization and project in Azure
DevOps
1. Go to visualstudio.microsoft.com , select Sign in at upper right, and sign into
your Microsoft account.

If you don't have a Microsoft account, select Sign up now, create a Microsoft
account, and sign in using this account. If your organization has a Visual Studio
subscription, sign in with the credentials for that subscription.

2. After you sign in, at upper right on the Azure DevOps page, select Create new
organization.

3. If you're prompted to agree to the Terms of Service, Privacy Statement, and Code
of Conduct, select Continue.

4. In the signup dialog, name your Azure DevOps organization and accept the host
region assignment, or drop down and select a different region. Then select
Continue.

5. Under Create a project to get started, enter GroupCommon, and then select
Create project.

The GroupCommon project Summary page opens. The page URL is


https://<servername>/<organization-name>/GroupCommon.

Set up the group common repositories


Azure Repos hosts the following types of repositories for your group:

Group common repositories: General-purpose repositories that multiple teams


within a data science unit can adopt for many data science projects.
Team repositories: Repositories for specific teams within a data science unit. These
repositories are specific for a team's needs, and may be used for multiple projects
within that team, but are not general enough to be used across multiple teams
within a data science unit.
Project repositories: Repositories for specific projects. Such repositories may not
be general enough for multiple projects within a team, or for other teams in a data
science unit.

To set up the group common repositories in your project, you:

Rename the default GroupCommon repository to GroupProjectTemplate


Create a new GroupUtilities repository

Rename the default project repository to


GroupProjectTemplate
To rename the default GroupCommon project repository to GroupProjectTemplate:

1. On the GroupCommon project Summary page, select Repos. This action takes you
to the default GroupCommon repository of the GroupCommon project, which is
currently empty.

2. At the top of the page, drop down the arrow next to GroupCommon and select
Manage repositories.

3. On the Project Settings page, select the ... next to GroupCommon, and then select
Rename repository.
4. In the Rename the GroupCommon repository popup, enter GroupProjectTemplate,
and then select Rename.

Create the GroupUtilities repository


To create the GroupUtilities repository:

1. On the GroupCommon project Summary page, select Repos.

2. At the top of the page, drop down the arrow next to GroupProjectTemplate and
select New repository.
3. In the Create a new repository dialog, select Git as the Type, enter GroupUtilities
as the Repository name, and then select Create.

4. On the Project Settings page, select Repositories under Repos in the left
navigation to see the two group repositories: GroupProjectTemplate and
GroupUtilities.
Import the Microsoft TDSP team repositories
In this part of the tutorial, you import the contents of the ProjectTemplate and Utilities
repositories managed by the Microsoft TDSP team into your GroupProjectTemplate and
GroupUtilities repositories.

To import the TDSP team repositories:

1. From the GroupCommon project home page, select Repos in the left navigation.
The default GroupProjectTemplate repo opens.

2. On the GroupProjectTemplate is empty page, select Import.


3. In the Import a Git repository dialog, select Git as the Source type, and enter
https://github.com/Azure/Azure-TDSP-ProjectTemplate.git for the Clone URL. Then
select Import. The contents of the Microsoft TDSP team ProjectTemplate
repository are imported into your GroupProjectTemplate repository.

4. At the top of the Repos page, drop down and select the GroupUtilities repository.

Each of your two group repositories now contains all the files, except those in the .git
directory, from the Microsoft TDSP team's corresponding repository.
Customize the contents of the group
repositories
If you want to customize the contents of your group repositories to meet the specific
needs of your group, you can do that now. You can modify the files, change the
directory structure, or add files that your group has developed or that are helpful for
your group.

Make changes in Azure Repos


To customize repository contents:

1. On the GroupCommon project Summary page, select Repos.

2. At the top of the page, select the repository you want to customize.

3. In the repo directory structure, navigate to the folder or file you want to change.

To create new folders or files, select the arrow next to New.

To upload files, select Upload file(s).


To edit existing files, navigate to the file and then select Edit.

4. After adding or editing files, select Commit.


Make changes using your local machine or DSVM
If you want to make changes using your local machine or DSVM and push the changes
up to the group repositories, make sure you have the prerequisites for working with Git
and DSVMs:

An Azure subscription, if you want to create a DSVM.


Git installed on your machine. If you're using a DSVM, Git is pre-installed.
Otherwise, see the Platforms and tools appendix.
If you want to use a DSVM, the Windows or Linux DSVM created and configured in
Azure. For more information and instructions, see the Data Science Virtual Machine
Documentation.
For a Windows DSVM, Git Credential Manager (GCM) installed on your machine.
In the README.md file, scroll down to the Download and Install section and select
the latest installer. Download the .exe installer from the installer page and run it.
For a Linux DSVM, an SSH public key set up on your DSVM and added in Azure
DevOps. For more information and instructions, see the Create SSH public key
section in the Platforms and tools appendix.

First, copy or clone the repository to your local machine.

1. On the GroupCommon project Summary page, select Repos, and at the top of the
page, select the repository you want to clone.
2. On the repo page, select Clone at upper right.

3. In the Clone repository dialog, select HTTPS for an HTTP connection, or SSH for
an SSH connection, and copy the clone URL under Command line to your
clipboard.

4. On your local machine, create the following directories:

For Windows: C:\GitRepos\GroupCommon


For Linux, $/GitRepos/GroupCommon on your home directory

5. Change to the directory you created.

6. In Git Bash, run the command git clone <clone URL>.

For example, either of the following commands clones the GroupUtilities


repository to the GroupCommon directory on your local machine.

HTTPS connection:

Bash

git clone
https://DataScienceUnit@dev.azure.com/DataScienceUnit/GroupCommon/_git/
GroupUtilities

SSH connection:

Bash

git clone
git@ssh.dev.azure.com:v3/DataScienceUnit/GroupCommon/GroupUtilities
After making whatever changes you want in the local clone of your repository, you can
push the changes to the shared group common repositories.

Run the following Git Bash commands from your local GroupProjectTemplate or
GroupUtilities directory.

Bash

git add .
git commit -m "push from local"
git push

7 Note

If this is the first time you commit to a Git repository, you may need to configure
global parameters user.name and user.email before you run the git commit
command. Run the following two commands:

git config --global user.name <your name>

git config --global user.email <your email address>

If you're committing to several Git repositories, use the same name and email
address for all of them. Using the same name and email address is convenient
when building Power BI dashboards to track your Git activities in multiple
repositories.

Add group members and configure


permissions
To add members to the group:

1. In Azure DevOps, from the GroupCommon project home page, select Project
settings from the left navigation.

2. From the Project Settings left navigation, select Teams, then on the Teams page,
select the GroupCommon Team.
3. On the Team Profile page, select Add.

4. In the Add users and groups dialog, search for and select members to add to the
group, and then select Save changes.
To configure permissions for members:

1. From the Project Settings left navigation, select Permissions.

2. On the Permissions page, select the group you want to add members to.

3. On the page for that group, select Members, and then select Add.

4. In the Invite members popup, search for and select members to add to the group,
and then select Save.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Here are links to detailed descriptions of the other roles and tasks in the Team Data
Science Process:

Team Lead tasks for a data science team


Project Lead tasks for a data science team
Project Individual Contributor tasks for a data science team
Tasks for the team lead on a Team Data
Science Process team
Article • 11/15/2022

This article describes the tasks that a team lead completes for their data science team.
The team lead's objective is to establish a collaborative team environment that
standardizes on the Team Data Science Process (TDSP). The TDSP is designed to help
improve collaboration and team learning.

The TDSP is an agile, iterative data science methodology to efficiently deliver predictive
analytics solutions and intelligent applications. The process distills the best practices and
structures from Microsoft and the industry. The goal is successful implementation of
data science initiatives and fully realizing the benefits of their analytics programs. For an
outline of the personnel roles and associated tasks for a data science team
standardizing on the TDSP, see Team Data Science Process roles and tasks.

A team lead manages a team consisting of several data scientists in the data science unit
of an enterprise. Depending on the data science unit's size and structure, the group
manager and the team lead might be the same person, or they could delegate their
tasks to surrogates. But the tasks themselves do not change.

The following diagram shows the workflow for the tasks the team lead completes to set
up a team environment:

1. Create a team project in the group's organization in Azure DevOps.

2. Rename the default team repository to TeamUtilities.

3. Create a new TeamTemplate repository in the team project.

4. Import the contents of the group's GroupUtilities and GroupProjectTemplate


repositories into the TeamUtilities and TeamTemplate repositories.

5. Set up security control by adding team members and configuring their


permissions.
6. If required, create team data and analytics resources:

Add team-specific utilities to the TeamUtilities repository.


Create Azure file storage to store data assets that can be useful for the entire
team.
Mount the Azure file storage to the team lead's Data Science Virtual
Machine (DSVM) and add data assets to it.

The following tutorial walks through the steps in detail.

7 Note

This article uses Azure DevOps and a DSVM to set up a TDSP team environment,
because that is how to implement TDSP at Microsoft. If your team uses other code
hosting or development platforms, the team lead tasks are the same, but the way
to complete them may be different.

Prerequisites
This tutorial assumes that the following resources and permissions have been set up by
your group manager:

The Azure DevOps organization for your data unit


GroupProjectTemplate and GroupUtilities repositories, populated with the
contents of the Microsoft TDSP team's ProjectTemplate and Utilities repositories
Permissions on your organization account for you to create projects and
repositories for your team

To be able to clone repositories and modify their content on your local machine or
DSVM, or set up Azure file storage and mount it to your DSVM, you need the following:

An Azure subscription.
Git installed on your machine. If you're using a DSVM, Git is pre-installed.
Otherwise, see the Platforms and tools appendix.
If you want to use a DSVM, the Windows or Linux DSVM created and configured in
Azure. For more information and instructions, see the Data Science Virtual Machine
Documentation.
For a Windows DSVM, Git Credential Manager (GCM) installed on your machine.
In the README.md file, scroll down to the Download and Install section and select
the latest installer. Download the .exe installer from the installer page and run it.
For a Linux DSVM, an SSH public key set up on your DSVM and added in Azure
DevOps. For more information and instructions, see the Create SSH public key
section in the Platforms and tools appendix.

Create a team project and repositories


In this section, you create the following resources in your group's Azure DevOps
organization:

The MyTeam project in Azure DevOps


The TeamTemplate repository
The TeamUtilities repository

The names specified for the repositories and directories in this tutorial assume that you
want to establish a separate project for your own team within your larger data science
organization. However, the entire group can choose to work under a single project
created by the group manager or organization administrator. Then, all the data science
teams create repositories under this single project. This scenario might be valid for:

A small data science group that doesn't have multiple data science teams.
A larger data science group with multiple data science teams that nevertheless
wants to optimize inter-team collaboration with activities such as group-level
sprint planning.

If teams choose to have their team-specific repositories under a single group project,
the team leads should create the repositories with names like <TeamName>Template
and <TeamName>Utilities. For instance: TeamATemplate and TeamAUtilities.

In any case, team leads need to let their team members know which template and
utilities repositories to set up and clone. Project leads should follow the project lead
tasks for a data science team to create project repositories, whether under separate
projects or a single project.

Create the MyTeam project


To create a separate project for your team:

1. In your web browser, go to your group's Azure DevOps organization home page at
URL https://<server name>/<organization name>, and select New project.
2. In the Create project dialog, enter your team name, such as MyTeam, under
Project name, and then select Advanced.

3. Under Version control, select Git, and under Work item process, select Agile. Then
select Create.
The team project Summary page opens, with page URL https://<server
name>/<organization name>/<team name>.

Rename the MyTeam default repository to TeamUtilities


1. On the MyTeam project Summary page, under What service would you like to
start with?, select Repos.
2. On the MyTeam repo page, select the MyTeam repository at the top of the page,
and then select Manage repositories from the dropdown.

3. On the Project Settings page, select the ... next to the MyTeam repository, and
then select Rename repository.
4. In the Rename the MyTeam repository popup, enter TeamUtilities, and then select
Rename.

Create the TeamTemplate repository


1. On the Project Settings page, select New repository.

Or, select Repos from the left navigation of the MyTeam project Summary page,
select a repository at the top of the page, and then select New repository from the
dropdown.
2. In the Create a new repository dialog, make sure Git is selected under Type. Enter
TeamTemplate under Repository name, and then select Create.

3. Confirm that you can see the two repositories TeamUtilities and TeamTemplate on
your project settings page.

Import the contents of the group common repositories


To populate your team repositories with the contents of the group common repositories
set up by your group manager:
1. From your MyTeam project home page, select Repos in the left navigation. If you
get a message that the MyTeam template is not found, select the link in
Otherwise, navigate to your default TeamTemplate repository.

The default TeamTemplate repository opens.

2. On the TeamTemplate is empty page, select Import.

3. In the Import a Git repository dialog, select Git as the Source type, and enter the
URL for your group common template repository under Clone URL. The URL is
https://<server name>/<organization name>/_git/<repository name>. For example:
https://dev.azure.com/DataScienceUnit/GroupCommon/_git/GroupProjectTemplate.

4. Select Import. The contents of your group template repository are imported into
your team template repository.
5. At the top of your project's Repos page, drop down and select the TeamUtilities
repository.

6. Repeat the import process to import the contents of your group common utilities
repository, for example GroupUtilities, into your TeamUtilities repository.

Each of your two team repositories now contains the files from the corresponding group
common repository.

Customize the contents of the team repositories


If you want to customize the contents of your team repositories to meet your team's
specific needs, you can do that now. You can modify files, change the directory
structure, or add files and folders.

To modify, upload, or create files or folders directly in Azure DevOps:

1. On the MyTeam project Summary page, select Repos.

2. At the top of the page, select the repository you want to customize.

3. In the repo directory structure, navigate to the folder or file you want to change.

To create new folders or files, select the arrow next to New.


To upload files, select Upload file(s).

To edit existing files, navigate to the file and then select Edit.
4. After adding or editing files, select Commit.

To work with repositories on your local machine or DSVM, you first copy or clone the
repositories to your local machine, and then commit and push your changes up to the
shared team repositories,

To clone repositories:

1. On the MyTeam project Summary page, select Repos, and at the top of the page,
select the repository you want to clone.

2. On the repo page, select Clone at upper right.

3. In the Clone repository dialog, under Command line, select HTTPS for an HTTP
connection or SSH for an SSH connection, and copy the clone URL to your
clipboard.
4. On your local machine, create the following directories:

For Windows: C:\GitRepos\MyTeam


For Linux, $home/GitRepos/MyTeam

5. Change to the directory you created.

6. In Git Bash, run the command git clone <clone URL> , where <clone URL> is the
URL you copied from the Clone dialog.

For example, use one of the following commands to clone the TeamUtilities
repository to the MyTeam directory on your local machine.

HTTPS connection:

Bash

git clone
https://DataScienceUnit@dev.azure.com/DataScienceUnit/MyTeam/_git/TeamU
tilities

SSH connection:

Bash

git clone git@ssh.dev.azure.com:v3/DataScienceUnit/MyTeam/TeamUtilities

After making whatever changes you want in the local clone of your repository, commit
and push the changes to the shared team repositories.
Run the following Git Bash commands from your local
GitRepos\MyTeam\TeamTemplate or GitRepos\MyTeam\TeamUtilities directory.

Bash

git add .
git commit -m "push from local"
git push

7 Note

If this is the first time you commit to a Git repository, you may need to configure
global parameters user.name and user.email before you run the git commit
command. Run the following two commands:

git config --global user.name <your name>

git config --global user.email <your email address>

If you're committing to several Git repositories, use the same name and email
address for all of them. Using the same name and email address is convenient
when building Power BI dashboards to track your Git activities in multiple
repositories.

Add team members and configure permissions


To add members to the team:

1. In Azure DevOps, from the MyTeam project home page, select Project settings
from the left navigation.

2. From the Project Settings left navigation, select Teams, then on the Teams page,
select the MyTeam Team.
3. On the Team Profile page, select Add.

4. In the Add users and groups dialog, search for and select members to add to the
group, and then select Save changes.

To configure permissions for team members:

1. From the Project Settings left navigation, select Permissions.

2. On the Permissions page, select the group you want to add members to.

3. On the page for that group, select Members, and then select Add.
4. In the Invite members popup, search for and select members to add to the group,
and then select Save.

Create team data and analytics resources


This step is optional, but sharing data and analytics resources with your entire team has
performance and cost benefits. Team members can execute their projects on the shared
resources, save on budgets, and collaborate more efficiently. You can create Azure file
storage and mount it on your DSVM to share with team members.

For information about sharing other resources with your team, such as Azure HDInsight
Spark clusters, see Platforms and tools. That topic provides guidance from a data
science perspective on selecting resources that are appropriate for your needs, and links
to product pages and other relevant and useful tutorials.

7 Note

To avoid transmitting data across data centers, which might be slow and costly,
make sure that your Azure resource group, storage account, and DSVM are all
hosted in the same Azure region.

Create Azure file storage


1. Run the following script to create Azure file storage for data assets that are useful
for your entire team. The script prompts you for your Azure subscription
information, so have that ready to enter.

On a Windows machine, run the script from the PowerShell command


prompt:

PowerShell
wget "https://raw.githubusercontent.com/Azure/Azure-
MachineLearning-DataScience/master/Misc/TDSP/CreateFileShare.ps1"
-outfile "CreateFileShare.ps1"
.\CreateFileShare.ps1

On a Linux machine, run the script from the Linux shell:

shell

wget "https://raw.githubusercontent.com/Azure/Azure-
MachineLearning-DataScience/master/Misc/TDSP/CreateFileShare.sh"
bash CreateFileShare.sh

2. Log in to your Microsoft Azure account when prompted, and select the
subscription you want to use.

3. Select the storage account to use, or create a new one under your selected
subscription. You can use lowercase characters, numbers, and hyphens for the
Azure file storage name.

4. To facilitate mounting and sharing the storage, press Enter or enter Y to save the
Azure file storage information into a text file in your current directory. You can
check in this text file to your TeamTemplate repository, ideally under
Docs\DataDictionaries, so all projects in your team can access it. You also need the
file information to mount your Azure file storage to your Azure DSVM in the next
section.

Mount Azure file storage on your local machine or DSVM


1. To mount your Azure file storage to your local machine or DSVM, use the following
script.

On a Windows machine, run the script from the PowerShell command


prompt:

PowerShell

wget "https://raw.githubusercontent.com/Azure/Azure-
MachineLearning-DataScience/master/Misc/TDSP/AttachFileShare.ps1"
-outfile "AttachFileShare.ps1"
.\AttachFileShare.ps1

On a Linux machine, run the script from the Linux shell:


shell

wget "https://raw.githubusercontent.com/Azure/Azure-
MachineLearning-DataScience/master/Misc/TDSP/AttachFileShare.sh"
bash AttachFileShare.sh

2. Press Enter or enter Y to continue, if you saved an Azure file storage information
file in the previous step. Enter the complete path and name of the file you created.

If you don't have an Azure file storage information file, enter n, and follow the
instructions to enter your subscription, Azure storage account, and Azure file
storage information.

3. Enter the name of a local or TDSP drive to mount the file share on. The screen
displays a list of existing drive names. Provide a drive name that doesn't already
exist.

4. Confirm that the new drive and storage is successfully mounted on your machine.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Here are links to detailed descriptions of the other roles and tasks defined by the Team
Data Science Process:

Group Manager tasks for a data science team


Project Lead tasks for a data science team
Project Individual Contributor tasks for a data science team
Project lead tasks in the Team Data
Science Process
Article • 11/15/2022

This article describes tasks that a project lead completes to set up a repository for their
project team in the Team Data Science Process (TDSP). The TDSP is a framework
developed by Microsoft that provides a structured sequence of activities to efficiently
execute cloud-based, predictive analytics solutions. The TDSP is designed to help
improve collaboration and team learning. For an outline of the personnel roles and
associated tasks for a data science team standardizing on the TDSP, see Team Data
Science Process roles and tasks.

A project lead manages the daily activities of individual data scientists on a specific data
science project in the TDSP. The following diagram shows the workflow for project lead
tasks:

This tutorial covers Step 1: Create project repository, and Step 2: Seed project repository
from your team ProjectTemplate repository.

For Step 3: Create Feature work item for project, and Step 4: Add Stories for project
phases, see Agile development of data science projects.

For Step 5: Create and customize storage/analysis assets and share, if necessary, see
Create team data and analytics resources.

For Step 6: Set up security control of project repository, see Add team members and
configure permissions.

7 Note

This article uses Azure Repos to set up a TDSP project, because that is how to
implement TDSP at Microsoft. If your team uses another code hosting platform, the
project lead tasks are the same, but the way to complete them may be different.
Prerequisites
This tutorial assumes that your group manager and team lead have set up the following
resources and permissions:

The Azure DevOps organization for your data unit


A team project for your data science team
Team template and utilities repositories
Permissions on your organization account for you to create and edit repositories
for your project

To clone repositories and modify content on your local machine or Data Science Virtual
Machine (DSVM), or set up Azure file storage and mount it to your DSVM, you also need
to consider this checklist:

An Azure subscription.
Git installed on your machine. If you're using a DSVM, Git is pre-installed.
Otherwise, see the Platforms and tools appendix.
If you want to use a DSVM, the Windows or Linux DSVM created and configured in
Azure. For more information and instructions, see the Data Science Virtual Machine
Documentation.
For a Windows DSVM, Git Credential Manager (GCM) installed on your machine.
In the README.md file, scroll down to the Download and Install section and select
the latest installer. Download the .exe installer from the installer page and run it.
For a Linux DSVM, an SSH public key set up on your DSVM and added in Azure
DevOps. For more information and instructions, see the Create SSH public key
section in the Platforms and tools appendix.

Create a project repository in your team project


To create a project repository in your team's MyTeam project:

1. Go to your team's project Summary page at https://<server name>/<organization


name>/<team name>, for example,
https://dev.azure.com/DataScienceUnit/MyTeam, and select Repos from the left
navigation.

2. Select the repository name at the top of the page, and then select New repository
from the dropdown.
3. In the Create a new repository dialog, make sure Git is selected under Type. Enter
DSProject1 under Repository name, and then select Create.

4. Confirm that you can see the new DSProject1 repository on your project settings
page.
Import the team template into your project
repository
To populate your project repository with the contents of your team template repository:

1. From your team's project Summary page, select Repos in the left navigation.

2. Select the repository name at the top of the page, and select DSProject1 from the
dropdown.

3. On the DSProject1 is empty page, select Import.


4. In the Import a Git repository dialog, select Git as the Source type, and enter the
URL for your TeamTemplate repository under Clone URL. The URL is
https://<server name>/<organization name>/<team name>/_git/<team template
repository name>. For example:
https://dev.azure.com/DataScienceUnit/MyTeam/_git/TeamTemplate.

5. Select Import. The contents of your team template repository are imported into
your project repository.
If you need to customize the contents of your project repository to meet your project's
specific needs, you can add, delete, or modify repository files and folders. You can work
directly in Azure Repos, or clone the repository to your local machine or DSVM, make
changes, and commit and push your updates to the shared project repository. Follow
the instructions at Customize the contents of the team repositories.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Here are links to detailed descriptions of the other roles and tasks defined by the Team
Data Science Process:

Group Manager tasks for a data science team


Team Lead tasks for a data science team
Individual Contributor tasks for a data science team
Tasks for an individual contributor in the
Team Data Science Process
Article • 11/15/2022

This topic outlines the tasks that an individual contributor completes to set up a project
in the Team Data Science Process (TDSP). The objective is to work in a collaborative
team environment that standardizes on the TDSP. The TDSP is designed to help improve
collaboration and team learning. For an outline of the personnel roles and their
associated tasks that are handled by a data science team standardizing on the TDSP, see
Team Data Science Process roles and tasks.

The following diagram shows the tasks that project individual contributors (data
scientists) complete to set up their team environment. For instructions on how to
execute a data science project under the TDSP, see Execution of data science projects.

ProjectRepository is the repository your project team maintains to share project


templates and assets.
TeamUtilities is the utilities repository your team maintains specifically for your
team.
GroupUtilities is the repository your group maintains to share useful utilities across
the entire group.

7 Note

This article uses Azure Repos and a Data Science Virtual Machine (DSVM) to set up
a TDSP environment, because that is how to implement TDSP at Microsoft. If your
team uses other code hosting or development platforms, the individual contributor
tasks are the same, but the way to complete them may be different.

Prerequisites
This tutorial assumes that the following resources and permissions have been set up by
your group manager, team lead, and project lead:

The Azure DevOps organization for your data science unit


A project repository set up by your project lead to share project templates and
assets
GroupUtilities and TeamUtilities repositories set up by the group manager and
team lead, if applicable
Azure file storage set up for shared assets for your team or project, if applicable
Permissions for you to clone from and push back to your project repository

To clone repositories and modify content on your local machine or DSVM, or mount
Azure file storage to your DSVM, you need to consider this checklist:

An Azure subscription.
Git installed on your machine. If you're using a DSVM, Git is pre-installed.
Otherwise, see the Platforms and tools appendix.
If you want to use a DSVM, the Windows or Linux DSVM created and configured in
Azure. For more information and instructions, see the Data Science Virtual Machine
Documentation.
For a Windows DSVM, Git Credential Manager (GCM) installed on your machine.
In the README.md file, scroll down to the Download and Install section and select
the latest installer. Download the .exe installer from the installer page and run it.
For a Linux DSVM, an SSH public key set up on your DSVM and added in Azure
DevOps. For more information and instructions, see the Create SSH public key
section in the Platforms and tools appendix.
The Azure file storage information for any Azure file storage you need to mount to
your DSVM.

Clone repositories
To work with repositories locally and push your changes up to the shared team and
project repositories, you first copy or clone the repositories to your local machine.

1. In Azure DevOps, go to your team's project Summary page at https://<server


name>/<organization name>/<team name>, for example,
https://dev.azure.com/DataScienceUnit/MyTeam.

2. Select Repos in the left navigation, and at the top of the page, select the repository
you want to clone.

3. On the repo page, select Clone at upper right.


4. In the Clone repository dialog, select HTTPS for an HTTP connection, or SSH for
an SSH connection, and copy the clone URL under Command line to your
clipboard.

5. On your local machine or DSVM, create the following directories:

For Windows: C:\GitRepos


For Linux: $home/GitRepos

6. Change to the directory you created.

7. In Git Bash, run the command git clone <clone URL> for each repository you want
to clone.

For example, the following command clones the TeamUtilities repository to the
MyTeam directory on your local machine.

HTTPS connection:

Bash

git clone
https://DataScienceUnit@dev.azure.com/DataScienceUnit/MyTeam/_git/TeamU
tilities

SSH connection:

Bash

git clone git@ssh.dev.azure.com:v3/DataScienceUnit/MyTeam/TeamUtilities


8. Confirm that you can see the folders for the cloned repositories in your local
project directory.

Mount Azure file storage to your DSVM


If your team or project has shared assets in Azure file storage, mount the file storage to
your local machine or DSVM. Follow the instructions at Mount Azure file storage on your
local machine or DSVM.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Here are links to detailed descriptions of the other roles and tasks defined by the Team
Data Science Process:

Group Manager tasks for a data science team


Team Lead tasks for a data science team
Project Lead tasks for a data science team
Agile development of data science
projects
Article • 11/15/2022

This document describes how developers can execute a data science project in a
systematic, version controlled, and collaborative way within a project team by using the
Team Data Science Process (TDSP). The TDSP is a framework developed by Microsoft
that provides a structured sequence of activities to efficiently execute cloud-based,
predictive analytics solutions. For an outline of the roles and tasks that are handled by a
data science team standardizing on the TDSP, see Team Data Science Process roles and
tasks.

This article includes instructions on how to:

Do sprint planning for work items involved in a project.


Add work items to sprints.
Create and use an agile-derived work item template that specifically aligns with
TDSP lifecycle stages.

The following instructions outline the steps needed to set up a TDSP team environment
using Azure Boards and Azure Repos in Azure DevOps. The instructions use Azure
DevOps because that is how to implement TDSP at Microsoft. If your group uses a
different code hosting platform, the team lead tasks generally don't change, but the way
to complete the tasks is different. For example, linking a work item with a Git branch
might not be the same with GitHub as it is with Azure Repos.

The following figure illustrates a typical sprint planning, coding, and source-control
workflow for a data science project:
Work item types
In the TDSP sprint planning framework, there are four frequently used work item types:
Features, User Stories, Tasks, and Bugs. The backlog for all work items is at the project
level, not the Git repository level.

Here are the definitions for the work item types:

Feature: A Feature corresponds to a project engagement. Different engagements


with a client are different Features, and it's best to consider different phases of a
project as different Features. If you choose a schema such as <ClientName>-
<EngagementName> to name your Features, you can easily recognize the context
of the project and engagement from the names themselves.

User Story: User Stories are work items needed to complete a Feature end-to-end.
Examples of User Stories include:
Get data
Explore data
Generate features
Build models
Operationalize models
Retrain models

Task: Tasks are assignable work items that need to be done to complete a specific
User Story. For example, Tasks in the User Story Get data could be:
Get SQL Server credentials
Upload data to Azure Synapse Analytics

Bug: Bugs are issues in existing code or documents that must be fixed to complete
a Task. If Bugs are caused by missing work items, they can escalate to be User
Stories or Tasks.

Data scientists may feel more comfortable using an agile template that replaces
Features, User Stories, and Tasks with TDSP lifecycle stages and substages. To create an
agile-derived template that specifically aligns with the TDSP lifecycle stages, see Use an
agile TDSP work template.

7 Note

TDSP borrows the concepts of Features, User Stories, Tasks, and Bugs from software
code management (SCM). The TDSP concepts might differ slightly from their
conventional SCM definitions.
Plan sprints
Many data scientists are engaged with multiple projects, which can take months to
complete and proceed at different paces. Sprint planning is useful for project
prioritization, and resource planning and allocation. In Azure Boards, you can easily
create, manage, and track work items for your projects, and conduct sprint planning to
ensure projects are moving forward as expected.

For more information about sprint planning, see Scrum sprints .

For more information about sprint planning in Azure Boards, see Assign backlog items
to a sprint.

Add a Feature to the backlog


After your project and project code repository are created, you can add a Feature to the
backlog to represent the work for your project.

1. From your project page, select Boards > Backlogs in the left navigation.

2. On the Backlog tab, if the work item type in the top bar is Stories, drop down and
select Features. Then select New Work Item.

3. Enter a title for the Feature, usually your project name, and then select Add to top.
4. From the Backlog list, select and open the new Feature. Fill in the description,
assign a team member, and set planning parameters.

You can also link the Feature to the project's Azure Repos code repository by
selecting Add link under the Development section.

After you edit the Feature, select Save & Close.

Add a User Story to the Feature


Under the Feature, you can add User Stories to describe major steps needed to
complete the project.

To add a new User Story to a Feature:

1. On the Backlog tab, select the + to the left of the Feature.


2. Give the User Story a title, and edit details such as assignment, status, description,
comments, planning, and priority.

You can also link the User Story to a branch of the project's Azure Repos code
repository by selecting Add link under the Development section. Select the
repository and branch you want to link the work item to, and then select OK.

3. When you're finished editing the User Story, select Save & Close.

Add a Task to a User Story


Tasks are specific detailed steps that are needed to complete each User Story. After all
Tasks of a User Story are completed, the User Story should be completed too.

To add a Task to a User Story, select the + next to the User Story item, and select Task.
Fill in the title and other information in the Task.
After you create Features, User Stories, and Tasks, you can view them in the Backlogs or
Boards views to track their status.

Use an agile TDSP work template


Data scientists may feel more comfortable using an agile template that replaces
Features, User Stories, and Tasks with TDSP lifecycle stages and substages. In Azure
Boards, you can create an agile-derived template that uses TDSP lifecycle stages to
create and track work items. The following steps walk through setting up a data science-
specific agile process template and creating data science work items based on the
template.

Set up an Agile Data Science Process template


1. From your Azure DevOps organization main page, select Organization settings
from the left navigation.

2. In the Organization Settings left navigation, under Boards, select Process.

3. In the All processes pane, select the ... next to Agile, and then select Create
inherited process.

4. In the Create inherited process from Agile dialog, enter the name
AgileDataScienceProcess, and select Create process.
5. In All processes, select the new AgileDataScienceProcess.

6. On the Work item types tab, disable Epic, Feature, User Story, and Task by
selecting the ... next to each item and then selecting Disable.
7. In All processes, select the Backlog levels tab. Under Portfolios backlogs, select
the ... next to Epic (disabled), and then select Edit/Rename.

8. In the Edit backlog level dialog box:


a. Under Name, replace Epic with TDSP Projects.
b. Under Work item types on this backlog level, select New work item type, enter
TDSP Project, and select Add.
c. Under Default work item type, drop down and select TDSP Project.
d. Select Save.

9. Follow the same steps to rename Features to TDSP Stages, and add the following
new work item types:

Business Understanding
Data Acquisition
Modeling
Deployment

10. Under Requirement backlog, rename Stories to TDSP Substages, add the new work
item type TDSP Substage, and set the default work item type to TDSP Substage.

11. Under Iteration backlog, add a new work item type TDSP Task, and set it to be the
default work item type.
After you complete the steps, the backlog levels should look like this:

Create Agile Data Science Process work items


You can use the data science process template to create TDSP projects and track work
items that correspond to TDSP lifecycle stages.

1. From your Azure DevOps organization main page, select New project.

2. In the Create new project dialog, give your project a name, and then select
Advanced.

3. Under Work item process, drop down and select AgileDataScienceProcess, and
then select Create.
4. In the newly created project, select Boards > Backlogs in the left navigation.

5. To make TDSP Projects visible, select the Configure team settings icon. In the
Settings screen, select the TDSP Projects check box, and then select Save and
close.
6. To create a data science-specific TDSP Project, select TDSP Projects in the top bar,
and then select New work item.

7. In the popup, give the TDSP Project work item a name, and select Add to top.

8. To add a work item under the TDSP Project, select the + next to the project, and
then select the type of work item to create.

9. Fill in the details in the new work item, and select Save & Close.
10. Continue to select the + symbols next to work items to add new TDSP Stages,
Substages, and Tasks.

Here is an example of how the data science project work items should appear in
Backlogs view:

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Collaborative coding with Git describes how to do collaborative code development
for data science projects using Git as the shared code development framework,
and how to link these coding activities to the work planned with the agile process.

Additional resources on agile processes:

Agile process
Agile process work item types and workflow

Related resources
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Collaborative coding with Git
Article • 11/15/2022

This article describes how to use Git as the collaborative code development framework
for data science projects. The article covers how to link code in Azure Repos to agile
development work items in Azure Boards, how to do code reviews, and how to create
and merge pull requests for changes.

Link a work item to an Azure Repos branch


Azure DevOps provides a convenient way to connect an Azure Boards User Story or Task
work item with an Azure Repos Git repository branch. You can link your User Story or
Task directly to the code associated with it.

To connect a work item to a new branch, select the Actions ellipsis (...) next to the work
item, and on the context menu, scroll to and select New branch.

In the Create a branch dialog, provide the new branch name and the base Azure Repos
Git repository and branch. The base repository must be in the same Azure DevOps
project as the work item. The base branch can be any existing branch. Select Create
branch.
You can also create a new branch using the following Git bash command in Windows or
Linux:

Bash

git checkout -b <new branch name> <base branch name>

If you don't specify a <base branch name>, the new branch is based on main .

To switch to your working branch, run the following command:

Bash

git checkout <working branch name>

After you switch to the working branch, you can start developing code or
documentation artifacts to complete the work item. Running git checkout main
switches you back to the main branch.

It's a good practice to create a Git branch for each User Story work item. Then, for each
Task work item, you can create a branch based on the User Story branch. Organize the
branches in a hierarchy that corresponds to the User Story-Task relationship when you
have multiple people working on different User Stories for the same project, or on
different Tasks for the same User Story. You can minimize conflicts by having each team
member work on a different branch, or on different code or other artifacts when sharing
a branch.

The following diagram shows the recommended branching strategy for TDSP. You might
not need as many branches as shown here, especially when only one or two people
work on a project, or only one person works on all Tasks of a User Story. But separating
the development branch from the primary branch is always a good practice, and can
help prevent the release branch from being interrupted by development activities. For a
complete description of the Git branch model, see A Successful Git Branching Model .
You can also link a work item to an existing branch. On the Detail page of a work item,
select Add link. Then select an existing branch to link the work item to, and select OK.

Work on the branch and commit changes


After you make a change for your work item, such as adding an R script file to your local
machine's script branch, you can commit the change from your local branch to the
upstream working branch by using the following Git bash commands:

Bash

git status
git add .
git commit -m "added an R script file"
git push origin script
Create a pull request
After one or more commits and pushes, when you're ready to merge your current
working branch into its base branch, you can create and submit a pull request in Azure
Repos.

From the main page of your Azure DevOps project, point to Repos > Pull requests in
the left navigation. Then select either of the New pull request buttons, or the Create a
pull request link.
On the New Pull Request screen, if necessary, navigate to the Git repository and branch
you want to merge your changes into. Add or change any other information you want.
Under Reviewers, add the names of the reviewers, and then select Create.

Review and merge


Once you create the pull request, your reviewers get an email notification to review the
pull request. The reviewers test whether the changes work, and check the changes with
the requester if possible. The reviewers can make comments, request changes, and
approve or reject the pull request based on their assessment.
After the reviewers approve the changes, you or someone else with merge permissions
can merge the working branch to its base branch. Select Complete, and then select
Complete merge in the Complete pull request dialog. You can choose to delete the
working branch after it has merged.
Confirm that the request is marked as COMPLETED.

When you go back to Repos in the left navigation, you can see that you've been
switched to the main branch since the script branch was deleted.
You can also use the following Git bash commands to merge the script working branch
to its base branch and delete the working branch after merging:

Bash

git checkout main


git merge script
git branch -d script

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect


To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Execute data science tasks shows how to use utilities to complete several common data
science tasks, such as interactive data exploration, data analysis, reporting, and model
creation.

Related resources
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Execute data science tasks: exploration,
modeling, and deployment
Article • 11/15/2022

Typical data science tasks include data exploration, modeling, and deployment. This
article outlines the tasks to complete several common data science tasks such as
interactive data exploration, data analysis, reporting, and model creation. Options for
deploying a model into a production environment may include:

Recommended: Azure Machine Learning


Possible: SQL-Server with ML services

1. Exploration
A data scientist can perform exploration and reporting in a variety of ways: by using
libraries and packages available for Python (matplotlib for example) or with R (ggplot or
lattice for example). Data scientists can customize such code to fit the needs of data
exploration for specific scenarios. The needs for dealing with structured data are
different that for unstructured data such as text or images.

Products such as Azure Machine Learning also provide advanced data preparation for
data wrangling and exploration, including feature creation. The user should decide on
the tools, libraries, and packages that best suite their needs.

The deliverable at the end of this phase is a data exploration report. The report should
provide a fairly comprehensive view of the data to be used for modeling and an
assessment of whether the data is suitable to proceed to the modeling step.

2. Modeling
There are numerous toolkits and packages for training models in a variety of languages.
Data scientists should feel free to use which ever ones they are comfortable with, as
long as performance considerations regarding accuracy and latency are satisfied for the
relevant business use cases and production scenarios.

Model management
After multiple models have been built, you usually need to have a system for registering
and managing the models. Typically you need a combination of scripts or APIs and a
backend database or versioning system. Azure Machine Learning provides deployment
of ONNX models or deployment of ML Flow models.

3. Deployment
Production deployment enables a model to play an active role in a business. Predictions
from a deployed model can be used for business decisions.

Production platforms
There are various approaches and platforms to put models into production. We
recommend deployment to Azure Machine Learning.

7 Note

Prior to deployment, one has to ensure the latency of model scoring is low enough
to use in production.

A/B testing
When multiple models are in production, it can be useful to perform A/B testing to
compare performance of the models.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Track progress of data science projects shows how a data scientist can track the
progress of a data science project.
Model operation and CI/CD shows how CI/CD can be performed with developed
models.
Test data science code with Azure
DevOps
Article • 11/15/2022

This article gives preliminary guidelines for testing code in a data science workflow,
using Azure DevOps. Such testing gives data scientists a systematic and efficient way to
check the quality and expected outcome of their code. We use a Team Data Science
Process (TDSP) project that uses the UCI Adult Income dataset that we published
earlier to show how code testing can be done.

Introduction on code testing


"Unit testing" is a longstanding practice for software development. But for data science,
it's often not clear what "unit testing" means and how you should test code for different
stages of a data science lifecycle, such as:

Data preparation
Data quality examination
Modeling
Model deployment

This article replaces the term "unit testing" with "code testing." It refers to testing as the
functions that help to assess if code for a certain step of a data science lifecycle is
producing results "as expected." The person who's writing the test defines what's "as
expected," depending on the outcome of the function--for example, data quality check
or modeling.

This article provides references as useful resources.

Azure DevOps for the testing framework


This article describes how to perform and automate testing by using Azure DevOps. You
might decide to use alternative tools. We also show how to set up an automatic build by
using Azure DevOps and build agents. For build agents, we use Azure Data Science
Virtual Machines (DSVMs).

Flow of code testing


The overall workflow of testing code in a data science project looks like this:
Detailed steps
Use the following steps to set up and run code testing and an automated build by using
a build agent and Azure DevOps:

1. Create a project in the Visual Studio desktop application:

After you create your project, you'll find it in Solution Explorer in the right pane:
2. Feed your project code into the Azure DevOps project code repository:
3. Suppose you've done some data preparation work, such as data ingestion, feature
engineering, and creating label columns. You want to make sure your code is
generating the results that you expect. Here's some code that you can use to test
whether the data-processing code is working properly:

Check that column names are right:

Check that response levels are right:

Check that response percentage is reasonable:

Check the missing rate of each column in the data:

4. After you've done the data processing and feature engineering work, and you've
trained a good model, make sure that the model you trained can score new
datasets correctly. You can use the following two tests to check the prediction
levels and distribution of label values:
Check prediction levels:

Check the distribution of prediction values:

5. Put all test functions together into a Python script called test_funcs.py:
6. After the test codes are prepared, you can set up the testing environment in Visual
Studio.

Create a Python file called test1.py. In this file, create a class that includes all the
tests you want to do. The following example shows six tests prepared:
1. Those tests can be automatically discovered if you put codetest.testCase after your
class name. Open Test Explorer in the right pane, and select Run All. All the tests
will run sequentially and will tell you if the test is successful or not.

2. Check in your code to the project repository by using Git commands. Your most
recent work will be reflected shortly in Azure DevOps.
3. Set up automatic build and test in Azure DevOps:

a. In the project repository, select Build and Release, and then select +New to
create a new build process.
b. Follow the prompts to select your source code location, project name,
repository, and branch information.

c. Select a template. Because there's no Python project template, start by selecting


Empty process.

d. Name the build and select the agent. You can choose the default here if you
want to use a DSVM to complete the build process. For more information about
setting agents, see Build and release agents.
e. Select + in the left pane, to add a task for this build phase. Because we're going
to run the Python script test1.py to complete all the checks, this task is using a
PowerShell command to run Python code.

f. In the PowerShell details, fill in the required information, such as the name and
version of PowerShell. Choose Inline Script as the type.

In the box under Inline Script, you can type python test1.py. Make sure the
environment variable is set up correctly for Python. If you need a different version
or kernel of Python, you can explicitly specify the path as shown in the figure:
g. Select Save & queue to complete the build pipeline process.

Now every time a new commit is pushed to the code repository, the build process will
start automatically. You can define any branch. The process runs the test1.py file in the
agent machine to make sure that everything defined in the code runs correctly.

If alerts are set up correctly, you'll be notified in email when the build is finished. You
can also check the build status in Azure DevOps. If it fails, you can check the details of
the build and find out which piece is broken.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
See the UCI income prediction repository for concrete examples of unit tests for
data science scenarios.
Follow the preceding outline and examples from the UCI income prediction
scenario in your own data science projects.

References
Team Data Science Process
Visual Studio Testing Tools
Azure DevOps Testing Resources
Data Science Virtual Machines
Track the progress of data science
projects
Article • 08/31/2023

Data science group managers, team leads, and project leads can track the progress of
their projects. Managers want to know what work has been done, who did the work, and
what work remains. Managing expectations is an important element of success.

Azure DevOps dashboards


If you're using Azure DevOps, you can build dashboards to track the activities and work
items associated with a given Agile project. For more information about dashboards, see
Dashboards, reports, and widgets.

For instructions on how to create and customize dashboards and widgets in Azure
DevOps, see the following quickstarts:

Add and manage dashboards


Add widgets to a dashboard

Example dashboard
Here is a simple example dashboard that tracks the sprint activities of an Agile data
science project, including the number of commits to associated repositories.

The countdown tile shows the number of days that remain in the current sprint.

The two code tiles show the number of commits in the two project repositories for
the past seven days.

Work items for TDSP Customer Project shows the results of a query for all work
items and their status.

A cumulative flow diagram (CFD) shows the number of Closed and Active work
items.

The burndown chart shows work still to complete against remaining time in the
sprint.

The burnup chart shows completed work compared to total amount of work in the
sprint.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Create CI/CD pipelines for AI apps
using Azure Pipelines, Docker, and
Kubernetes
Azure Azure Pipelines Azure Machine Learning Azure Kubernetes Service (AKS)

An Artificial Intelligence (AI) application is application code embedded with a pretrained


machine learning (ML) model. There are always two streams of work for an AI
application: Data scientists build the ML model, and app developers build the app and
expose it to end users to consume. This article describes how to implement a
continuous integration and continuous delivery (CI/CD) pipeline for an AI application
that embeds the ML model into the app source code. The sample code and tutorial use
a Python Flask web application, and fetch a pretrained model from a private Azure blob
storage account. You could also use an AWS S3 storage account.

7 Note

The following process is one of several ways to do CI/CD. There are alternatives to
this tooling and the prerequisites.

Source code, tutorial, and prerequisites


You can download source code and a detailed tutorial from GitHub. Follow the
tutorial steps to implement a CI/CD pipeline for your own application.

To use the downloaded source code and tutorial, you need the following prerequisites:

The source code repository forked to your GitHub account


An Azure DevOps Organization
Azure CLI
An Azure Container Service for Kubernetes (AKS) cluster
Kubectl to run commands and fetch configuration from the AKS cluster
An Azure Container Registry (ACR) account

CI/CD pipeline summary


Each new Git commit kicks off the Build pipeline. The build securely pulls the latest ML
model from a blob storage account, and packages it with the app code in a single
container. This decoupling of the application development and data science
workstreams ensures that the production app is always running the latest code with the
latest ML model. If the app passes testing, the pipeline securely stores the build image
in a Docker container in ACR. The release pipeline then deploys the container using AKS.

CI/CD pipeline steps


The following diagram and steps describe the CI/CD pipeline architecture:

1. Developers work on the application code in the IDE of their choice.


2. The developers commit the code to Azure Repos, GitHub, or other Git source
control provider.
3. Separately, data scientists work on developing their ML model.
4. The data scientists publish the finished model to a model repository, in this case a
blob storage account.
5. Azure Pipelines kicks off a build based on the Git commit.
6. The Build pipeline pulls the latest ML model from blob storage and creates a
container.
7. The pipeline pushes the build image to the private image repository in ACR.
8. The Release pipeline kicks off based on the successful build.
9. The pipeline pulls the latest image from ACR and deploys it across the Kubernetes
cluster on AKS.
10. User requests for the app go through the DNS server.
11. The DNS server passes the requests to a load balancer, and sends responses back
to the users.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Team Data Science Process (TDSP)
Azure Machine Learning (AML)
Azure DevOps
Azure Kubernetes Services (AKS)

Related resources
What are Azure Machine Learning pipelines?
Compare Microsoft machine learning products and technologies
Machine learning operations (MLOps) v2
Team Data Science Process for data
scientists
Article • 10/09/2023

This article provides guidance to a set of objectives that are typically used to implement
comprehensive data science solutions with Azure technologies. You are guided through:

Understanding an analytics workload


Using the Team Data Science Process
Using Azure Machine Learning
Understanding the foundations of data transfer and storage
Providing data source documentation
Using tools for analytics processing

These training materials are related to the Team Data Science Process (TDSP) and
Microsoft and open-source software and toolkits, which are helpful for envisioning,
executing and delivering data science solutions.

Lesson Path
You can use the items in the following table to guide your own self-study. Read the
Description column to follow the path, click on the Topic links for study references, and
check your skills using the Knowledge Check column.

Objective Topic Description Knowledge Check

Understand An introduction We begin by covering an Review and download the


the processes to the Team overview of the Team Data TDSP Project Structure
for developing Data Science Science Process – the TDSP. artifacts to your local
analytic Process This process guides you machine for your project.
projects through each step of an
analytics project. Read
through each of these
sections to learn more
about the process and how
you can implement it.

Agile The Team Data Science Explain Continuous


Development Process works well with Integration and Continuous
many different Delivery to a colleague.
programming
methodologies. In this
Learning Path, we use Agile
Objective Topic Description Knowledge Check

software development. Read


through the "What is Agile
Development?" and
"Building Agile Culture"
articles, which cover the
basics of working with Agile.
There are also other
references at this site where
you can learn more.

DevOps for Developer Operations Prepare a 30-minute


Data Science (DevOps) involves people, presentation to a technical
processes, and platforms audience on how DevOps is
you can use to work essential for analytics
through a project and projects.
integrate your solution into
an organization's standard
IT. This integration is
essential for adoption,
safety, and security. In this
online course, you learn
about DevOps practices as
well as understand some of
the toolchain options you
have.

Understand Microsoft We focus on a few Download and review the


the Business technologies in this presentation materials from
Technologies Analytics and AI Learning Path that you can this workshop .
for Data use to create an analytics
Storage and solution, but Microsoft has
Processing many more. To understand
the options you have, it's
important to review the
platforms and features
available in Microsoft Azure,
the Azure Stack, and on-
premises options. Review
this resource to learn the
various tools you have
available to answer analytics
question.

Setup and Microsoft Azure Now let's create an account If you do not have an Azure
Configure your in Microsoft Azure for Account, create one . Log in
training, training and learn how to to the Microsoft Azure portal
development, create development and and create one Resource
and test environments. These Group for training.
Objective Topic Description Knowledge Check

production free training resources get


environments you started. Complete the
"Beginner" and
"Intermediate" paths.

The Microsoft There are multiple ways of Set your default subscription
Azure working with Microsoft with the Azure CLI.
Command-Line Azure – from graphical tools
Interface (CLI) like VSCode and Visual
Studio, to Web interfaces
such as the Azure portal,
and from the command line,
such as Azure PowerShell
commands and functions. In
this article, we cover the
Command-Line Interface
(CLI), which you can use
locally on your workstation,
in Windows and other
Operating Systems, as well
as in the Azure portal.

Microsoft Azure You need a place to store Create a Storage Account in


Storage your data. In this article, you your training Resource
learn about Microsoft Group, create a container for
Azure's storage options, a Blob object, and upload
how to create a storage and download data.
account, and how to copy or
move data to the cloud.
Read through this
introduction to learn more.

Microsoft Entra Microsoft Entra ID forms the Add one user to Microsoft
ID basis of securing your Entra ID. NOTE: You may not
application. In this article, have permissions for this
you learn more about action if you are not the
accounts, rights, and administrator for the
permissions. Active subscription. If that's the
Directory and security are case, simply review this
complex topics, so just read tutorial to learn more.
through this resource to
understand the
fundamentals.

The Microsoft You can install the tools for Create a Data Science Virtual
Azure Data working with Data Science Machine and work through at
Science Virtual locally on multiple least one lab.
Machine operating systems. But the
Objective Topic Description Knowledge Check

Microsoft Azure Data


Science Virtual Machine
(DSVM) contains all of the
tools you need and plenty
of project samples to work
with. In this article, you learn
more about the DVSM and
how to work through its
examples. This resource
explains the Data Science
Virtual Machine, how you
can create one, and a few
options for developing code
with it. It also contains all
the software you need to
complete this learning path
– so make sure you
complete the Knowledge
Path for this topic.

Install and Working with To follow our DevOps Clone this GitHub project for
Understand git process with the TDSP, we your learning path project
the tools and need to have a version- structure .
technologies control system. Microsoft
for working Azure Machine Learning
with Data uses git, a popular open-
Science source distributed
solutions repository system. In this
article, you learn more
about how to install,
configure, and work with git
and a central repository –
GitHub.

VSCode VSCode is a cross-platform Install VSCode, and work


Integrated Development through the VS Code features
Environment (IDE) that you in the Interactive Editor
can use with multiple Playground .
languages and Azure tools.
You can use this single
environment to create your
entire solution. Watch these
introductory videos to get
started.

Programming In this solution we use Add one entity to an Azure


with Python Python, one of the most Table using Python.
popular languages in Data
Objective Topic Description Knowledge Check

Science. This article covers


the basics of writing analytic
code with Python, and
resources to learn more.
Work through sections 1-9
of this reference, then check
your knowledge.

Working with Notebooks are a way of Open this page , and click
Notebooks introducing text and code in on the "Welcome to
the same document. Azure Python.ipynb" link. Work
Machine Learning work with through the examples on that
Notebooks, so it is page.
beneficial to understand
how to use them. Read
through this tutorial and
give it a try in the
Knowledge Check section.

Machine Creating advanced Analytic Locate a resource on


Learning solutions involves working Machine Learning Algorithms.
with data, using Machine (Hint: Search on "azure
Learning, which also forms machine learning algorithm
the basis of working with cheat sheet")
Artificial Intelligence and
Deep Learning. This course
teaches you more about
Machine Learning. For a
comprehensive course on
Data Science, check out this
certification .

scikit-learn The scikit-learn set of tools Using the Iris dataset, persist
allows you to perform data an SVM model using Pickle.
science tasks in Python. We
use this framework in our
solution. This article covers
the basics and explains
where you can learn more.

Working with Docker is a distributed Open Visual Studio Code, and


Docker platform used to build, ship, install the Docker
and run applications, and is Extension . Create a simple
used frequently in Azure Node Docker container .
Machine Learning. This
article covers the basics of
this technology and explains
Objective Topic Description Knowledge Check

where you can go to learn


more.

HDInsight HDInsight is the Hadoop Create a small HDInsight


open-source infrastructure, cluster. Use HiveQL
available as a service in statements to project
Microsoft Azure. Your columns onto an
Machine Learning /example/data/sample.log
algorithms may involve file. Alternatively, you can
large sets of data, and complete this knowledge
HDInsight has the ability to check on your local system.
store, transfer and process
data at large scale. This
article covers working with
HDInsight.

Create a Data Determining the With the development Locate a resource on "The 5
Processing Question, environment installed and data science questions" and
Flow from following the configured, and the describe one question your
Business TDSP understanding of the organization might have in
Requirements technologies and processes these areas. Which
in place, it's time to put algorithms should you focus
everything together using on for that question?
the TDSP to perform an
analysis. We need to start by
defining the question,
selecting the data sources,
and the rest of the steps in
the Team Data Science
Process. Keep in mind the
DevOps process as we work
through this process. In this
article, you learn how to
take the requirements from
your organization and
create a data flow map
through your application to
define your solution using
the Team Data Science
Process

Use Azure Azure Machine Microsoft Azure Machine


Machine Learning Learning uses AI for data
Learning to wrangling and feature
create a engineering, manages
predictive experiments, and tracks
solution model runs. All of this works
in a single environment and
Objective Topic Description Knowledge Check

most functions can run


locally or in Azure. You can
use the PyTorch, TensorFlow,
and other frameworks to
create your experiments. In
this article, we focus on a
complete example of this
process, using everything
you've learned so far.

Use Power BI Power BI Power BI is Microsoft's data Complete this tutorial on


to visualize visualization tool. It is Power BI. Then connect
results available on multiple Power BI to the Blob CSV
platforms from Web to created in an experiment run.
mobile devices and desktop
computers. In this article,
you learn how to work with
the output of the solution
you've created by accessing
the results from Azure
storage and creating
visualizations using Power
BI.

Monitor your Application There are multiple tools you Set up Application Insights to
Solution Insights can use to monitor your end monitor an Application .
solution. Azure Application
Insights makes it easy to
integrate built-in
monitoring into your
solution.

Azure Monitor Another method to monitor Complete this tutorial on


logs your application is to using Azure Monitor logs.
integrate it into your
DevOps process. The Azure
Monitor logs system
provides a rich set of
features to help you watch
your analytic solutions after
you deploy them.

Complete this Congratulations! You've


Learning Path completed this learning
path.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
See Team Data Science Process for Developer Operations. This article explores the
Developer Operations (DevOps) functions that are specific to an Advanced Analytics and
Cognitive Services solution implementation.

Related resources
Execute data science tasks: exploration, modeling, and deployment
Set up data science environments for use in the Team Data Science Process
Platforms and tools for data science projects
Data science and machine learning with Azure Databricks
What is the Team Data Science Process?
Team Data Science Process for
Developer Operations
Article • 06/21/2023

This article explores the Developer Operations (DevOps) functions that are specific to an
Advanced Analytics and Cognitive Services solution implementation. These training
materials implement the Team Data Science Process (TDSP) and Microsoft and open-
source software and toolkits, helpful for envisioning, executing and delivering data
science solutions. It references topics that cover the DevOps Toolchain that is specific to
Data Science and AI projects and solutions.

Lesson Path
The following table provides level-based guidance to help complete the DevOps
objectives for implementing data science solutions on Azure.

Objective Topic Resource Technologies Level Prerequisites

Understand The Team This technical Data Science Intermediate General


Advanced Data Science walkthrough technology
Analytics Process describes the background,
Lifecycle Team Data familiarity with
Science Process data solutions,
Familiarity with
IT projects and
solution
implementation

Understand Information This reference Microsoft Experienced General


the Management gives and Azure Data technology
Microsoft overview of Factory background,
Azure Azure Data familiarity with
Platform Factory to build data solutions,
for pipelines for Familiarity with
Advanced analytics data IT projects and
Analytics solutions solution
implementation
Objective Topic Resource Technologies Level Prerequisites

This reference Microsoft Intermediate General


covers an Azure Data technology
overview of the Catalog background,
Azure Data familiarity with
Catalog which data solutions,
you can use to familiarity with
document and Relational
manage Database
metadata on Management
your data Systems
sources (RDBMS) and
NoSQL data
sources

This reference Azure Event Intermediate General


covers an Hubs technology
overview of the background,
Azure Event familiarity with
Hubs system data solutions,
and how you familiarity with
and use it to Relational
ingest data into Database
your solution Management
Systems
(RDBMS) and
NoSQL data
sources,
familiarity with
the Internet of
Things (IoT)
terminology
and use
Objective Topic Resource Technologies Level Prerequisites

Big Data This reference Azure Synapse Experienced General


Stores covers an Analytics technology
overview of background,
using the Azure familiarity with
Synapse data solutions,
Analytics to familiarity with
store and Relational
process large Database
amounts of Management
data Systems
(RDBMS) and
NoSQL data
sources,
familiarity with
HDFS
terminology
and use

This reference Azure Data Intermediate General


covers an Lake Store technology
overview of background,
using Azure familiarity with
Data Lake to data solutions,
capture data of familiarity with
any size, type, NoSQL data
and ingestion sources,
speed in one familiarity with
single place for HDFS
operational and
exploratory
analytics

Machine This reference Azure Machine Intermediate General


learning and covers an Learning technology
analytics introduction to background,
machine familiarity with
learning, data solutions,
predictive familiarity with
analytics, and Data Science
Artificial terms,
Intelligence familiarity with
systems Machine
Learning and
artificial
intelligence
terms
Objective Topic Resource Technologies Level Prerequisites

This article Azure Intermediate General


provides an HDInsight technology
introduction to background,
Azure familiarity with
HDInsight, a data solutions,
cloud familiarity with
distribution of NoSQL data
the Hadoop sources
technology
stack. It also
covers what a
Hadoop cluster
is and when
you would use
it

This reference Azure Data Intermediate General


covers an Lake Analytics technology
overview of the background,
Azure Data Lake familiarity with
Analytics job data solutions,
service familiarity with
NoSQL data
sources

This overview Azure Stream Intermediate General


covers using Analytics technology
Azure Stream background,
Analytics as a familiarity with
fully-managed data solutions,
event- familiarity with
processing structured and
engine to up unstructured
real-time data concepts
analytic
computations
on streaming
data
Objective Topic Resource Technologies Level Prerequisites

Intelligence This reference Cognitive Experienced General


covers an Services technology
overview of the background,
available familiarity with
Cognitive data solutions,
Services (such software
as vision, text, development
and search) and
how to get
started using
them

This reference Bot Framework Experienced General


covers and technology
introduction to background,
the Microsoft familiarity with
Bot Framework data solutions
and how to get
started using it

Visualization This self-paced, Microsoft Beginner General


online course Power BI technology
covers the background,
Power BI familiarity with
system, and data solutions
how to create
and publish
reports
Objective Topic Resource Technologies Level Prerequisites

Solutions This resource Microsoft Intermediate General


page covers Azure, Azure technology
multiple Machine background,
applications Learning, familiarity with
you can review, Cognitive data solutions
test and Services,
implement to Microsoft R,
see a complete Azure
solution from Cognitive
start to finish Search,
Python, Azure
Data Factory,
Power BI,
Azure
Document DB,
Application
Insights, Azure
SQL DB, Azure
Synapse
Analytics,
Microsoft SQL
Server, Azure
Data Lake,
Cognitive
Services, Bot
Framework,
Azure Batch,

Understand What is This article DevOps, Intermediate Familiarity with


and DevOps? explains the Microsoft Agile and other
Implement fundamentals Azure Development
DevOps of DevOps and Platform, Frameworks, IT
processes helps explain Azure DevOps Operations
how they map Familiarity
to DevOps
practices

Use the Configure This reference Visio Intermediate General


DevOps covers the technology
Toolchain basics of background,
for Data choosing the familiarity with
Science proper data solutions
visualization in
Visio to
communicate
your project
design
Objective Topic Resource Technologies Level Prerequisites

This reference Azure Intermediate General


describes the Resource technology
Azure Resource Manager, background,
Manager, terms, Azure familiarity with
and serves as PowerShell, data solutions
the primary Azure CLI
root source for
samples,
getting started,
and other
references

This reference Data Science Experienced Familiarity with


explains the Virtual Data Science
Azure Data Machine Workloads,
Science Virtual Linux
Machines for
Linux and
Windows

This Visual Studio Intermediate Software


walkthrough Development
explains
configuring
Azure cloud
service roles
with Visual
Studio - pay
close attention
to the
connection
strings
specifically for
storage
accounts

This series Microsoft Intermediate Understand


teaches you Project Project
how to use Management
Microsoft Fundamentals
Project to
schedule time,
resources and
goals for an
Advanced
Analytics
project
Objective Topic Resource Technologies Level Prerequisites

This Microsoft Microsoft Intermediate Understand


Project Project Project
template Management
provides a time, Fundamentals
resources and
goals tracking
for an
Advanced
Analytics
project

This Azure Data Azure Data Beginner Familiarity with


Catalog tutorial Catalog Data Sources
describes a and Structures
system of
registration and
discovery for
enterprise data
assets

This Microsoft Visual Studio Experienced Software


Virtual Codespace Development,
Academy familiarity with
course explains Dev/Test
how to set up environments
Dev-Test with
Visual Studio
Codespace and
Microsoft
Azure

This System Center Intermediate Experience with


Management System Center
Pack download for IT
for Microsoft Management
System Center
contains a
Guidelines
Document to
assist in
working with
Azure assets
Objective Topic Resource Technologies Level Prerequisites

This document PowerShell Intermediate Experience with


is intended for DSC PowerShell
developer and coding,
operations enterprise
teams to architectures,
understand the scripting
benefits of
PowerShell
Desired State
Configuration

Code This download Visual Studio Intermediate Software


also contains Codespace Development
documentation
on using Visual
Studio
Codespace
Code for
creating Data
Science and AI
applications

This getting Visual Studio Beginner Software


started site Development
teaches you
about DevOps
and Visual
Studio

You can write Azure portal Highly Data Science


code directly Experienced background -
from the Azure but read this
portal using the anyway
App Service
Editor. Learn
more at this
resource about
Continuous
Integration with
this tool

This resource Azure Machine Intermediate Software


explains how to Learning Development
get started with
Azure Machine
Learning
Objective Topic Resource Technologies Level Prerequisites

This reference Data Science Experienced Software


contains a list Virtual Development,
and a study link Machine Data Science
to all of the
development
tools on the
Data Science
Virtual Machine
in Azure

Read and Azure Security Intermediate System


understand Architecture
each of the Experience,
references in Security
this Azure Development
Security Trust experience
Center for
Security,
Privacy, and
Compliance -
VERY
important

Build This course Visual Studio Experienced Software


teaches you Codespace Development,
about enabling Familiarity with
DevOps an SDLC
Practices with
Visual Studio
Codespace
Build

This reference Visual Studio Intermediate Software


explains Development,
compiling and Familiarity with
building using an SDLC
Visual Studio

This reference System Center Experienced Experience with


explains how to System Center
orchestrate Orchestrator
processes such
as software
builds with
Runbooks
Objective Topic Resource Technologies Level Prerequisites

Test Use this Visual Studio Experienced Software


reference to Codespace Development,
understand Familiarity with
how to use an SDLC
Visual Studio
Codespace for
Test Case
Management

Use this System Center Experienced Experience with


previous System Center
reference for Orchestrator
Runbooks to
automate tests
using System
Center

As part of not Threat Experienced Familiarity with


only testing but Monitoring security
development, Tool concepts,
you should software
build in development
Security. The
Microsoft SDL
Threat
Modeling Tool
can help in all
phases. Learn
more and
download it
here

This article Attack Surface Experienced Familiarity with


explains how to Analyzer security
use the concepts,
Microsoft software
Attack Surface development
Analyzer to test
your Advanced
Analytics
solution
Objective Topic Resource Technologies Level Prerequisites

Package This reference Visual Studio Experienced Software


explains the Codespace development,
concepts of familiarity with
working with an SDLC
Packages in TFS
and Visual
Studio
Codespace

Use this System Center Experienced Experience with


previous System Center
reference for Orchestrator
Runbooks to
automate
packaging
using System
Center

This reference Azure Data Intermediate General


explains how to Factory computing
create a data background,
pipeline for data project
your solution, experience
which you can
save as a JSON
template as a
"package"

This topic Azure Intermediate Familiarity with


describes the Resource the Microsoft
structure of an Manager Azure Platform
Azure Resource
Manager
template
Objective Topic Resource Technologies Level Prerequisites

DSC is a PowerShell Intermediate PowerShell


management Desired State coding,
platform in Configuration familiarity with
PowerShell that enterprise
enables you to architectures,
manage your IT scripting
and
development
infrastructure
with
configuration as
code, saved as
a package. This
reference is an
overview for
that topic

Release This head- Visual Studio Experienced Software


reference article Codespace development,
contains familiarity with
concepts for CI/CD
build, test, and environments,
release for familiarity with
CI/CD an SDLC
environments

Use this System Center Experienced Experience with


previous System Center
reference for Orchestrator
Runbooks to
automate
release
management
using System
Center
Objective Topic Resource Technologies Level Prerequisites

This article Microsoft Intermediate Software


helps you Azure development,
determine the Deployment experience with
best option to the Microsoft
deploy the files Azure platform
for your web
app, mobile
app backend, or
API app to
Azure App
Service, and
then guides you
to appropriate
resources with
instructions
specific to your
preferred
option

Monitor This reference Application Intermediate Software


explains Insights Development,
Application familiarity with
Insights and the Microsoft
how you can Azure platform
add it to your
Advanced
Analytics
Solutions

This topic System Center Experienced Familiarity with


explains basic enterprise
concepts about monitoring,
Operations System Center
Manager for Operations
the Manager
administrator
who manages
the Operations
Manager
infrastructure
and the
operator who
monitors and
supports the
Advanced
Analytics
Solution
Objective Topic Resource Technologies Level Prerequisites

This blog entry Azure Data Intermediate Familiarity with


explains how to Factory Azure Data
use the Azure Factory
Data Factory to
monitor and
manage the
Advanced
Analytics
pipeline

Understand Open Source This reference Chef Experienced Familiarity with


how to use DevOps page contains the Azure
Open Tools and two videos and Platform,
Source Azure a whitepaper Familiarity with
Tools with on using Chef DevOps
DevOps on with Azure
Azure deployments

This site has a DevOps, Experienced Used an SDLC,


toolchain Microsoft familiarity with
selection Azure Agile and other
path Platform, Development
Azure DevOps, Frameworks, IT
Open Source Operations
Software Familiarity

This tutorial Jenkins Experienced Familiarity with


automates the the Azure
build and test Platform,
phase of Familiarity with
application DevOps,
development Familiarity with
using a Jenkins
continuous
integration and
deployment
CI/CD pipeline
Objective Topic Resource Technologies Level Prerequisites

This contains an Docker Intermediate Familiarity with


overview of the Azure
working with Platform,
Docker and Familiarity with
Azure as well as Server
additional Operating
references for Systems
implementation
for Data
Science
applications

This installation VSCODE Intermediate Software


and explanation Development,
explains how to familiarity with
use Visual the Microsoft
Studio Code Azure Platform
with Azure
assets

This blog entry R Studio Intermediate R Language


explains how to experience
use R Studio
with Microsoft
R

This blog entry Git, GitHub Intermediate Software


shows how to Development
use continuous
integration with
Azure and
GitHub

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
See Team Data Science Process for data scientists. This article provides guidance for
implementing data science solutions with Azure.

Related resources
Execute data science tasks: exploration, modeling, and deployment
Set up data science environments for use in the Team Data Science Process
Platforms and tools for data science projects
Data science and machine learning with Azure Databricks
What is the Team Data Science Process?
Set up data science environments for
use in the Team Data Science Process
Article • 11/15/2022

The Team Data Science Process uses various data science environments for the storage,
processing, and analysis of data. They include Azure Blob Storage, several types of Azure
virtual machines, HDInsight (Hadoop) clusters, and Machine Learning workspaces. The
decision about which environment to use depends on the type and quantity of data to
be modeled and the target destination for that data in the cloud.

See Quickstart: Create workspace resources you need to get started with Azure
Machine Learning.

The Microsoft Data Science Virtual Machine (DSVM) is also available as an Azure virtual
machine (VM) image. This VM is pre-installed and configured with several popular tools
that are commonly used for data analytics and machine learning. The DSVM is available
on both Windows and Linux. For more information, see Introduction to the cloud-based
Data Science Virtual Machine for Linux and Windows.

Learn how to create:

Windows DSVM
Ubuntu DSVM
CentOS DSVM

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Platforms and tools for data science
projects
Article • 05/30/2023

Microsoft provides a full spectrum of analytics resources for both cloud or on-premises
platforms. They can be deployed to make the execution of your data science projects
efficient and scalable. Guidance for teams implementing data science projects in a
trackable, version controlled, and collaborative way is provided by the Team Data
Science Process (TDSP). See Team Data Science Process roles and tasks, for an outline of
the personnel roles, and their associated tasks that are handled by a data science team
standardizing on this process.

The main recommended Azure resource for TDSP is Azure Machine Learning. Examples
in Azure Architecture Center sometimes show Azure Machine Learning used with other
Azure resources. These other analytics resources available to data science teams using
the TDSP include:

Data Science Virtual Machines (both Windows and Linux CentOS)


HDInsight Spark Clusters
Azure Synapse Analytics
Azure Data Lake
HDInsight Hive Clusters
Azure File Storage
SQL Server 2019 R and Python Services
Azure Databricks

In this document, we briefly describe the resources and provide links to the tutorials and
walkthroughs the TDSP teams have published. The articles will show you how to these
resources step by step to build your intelligent applications. More information on these
resources is available on their product pages.

Data Science Virtual Machine (DSVM)


The data science virtual machine offered on both Windows and Linux by Microsoft,
contains popular tools for data science modeling and development activities. It includes
tools such as:

Microsoft R Server Developer Edition


Anaconda Python distribution
Jupyter notebooks for Python and R
Visual Studio Community Edition with Python and R Tools on Windows / Eclipse on
Linux
Power BI desktop for Windows
SQL Server 2016 Developer Edition on Windows / Postgres on Linux

It also includes ML and AI tools like xgboost, mxnet, and Vowpal Wabbit.

Currently DSVM is available in Windows and Linux CentOS operating systems. Choose
the size of your DSVM (number of CPU cores and the amount of memory) based on the
needs of the data science projects that you plan to execute on it.

For more information on Windows edition of DSVM, see Microsoft Data Science Virtual
Machine on the Azure Marketplace. For the Linux edition of the DSVM, see Linux Data
Science Virtual Machine .

To learn how to execute some of the common data science tasks on the DSVM
efficiently, see 10 things you can do on the Data science Virtual Machine

Azure HDInsight Spark clusters


Apache Spark is an open-source parallel processing framework that supports in-memory
processing to boost the performance of big-data analytic applications. The Spark
processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-
memory computation capabilities make it a good choice for iterative algorithms in
machine learning and for graph computations. Spark is also compatible with Azure Blob
storage (WASB), so your existing data stored in Azure can easily be processed using
Spark.

When you create a Spark cluster in HDInsight, you create Azure compute resources with
Spark installed and configured. It takes about 10 minutes to create a Spark cluster in
HDInsight. Store the data to be processed in Azure Blob storage. For information on
using Azure Blob Storage with a cluster, see Use HDFS-compatible Azure Blob storage
with Hadoop in HDInsight.

TDSP team from Microsoft has published two end-to-end walkthroughs on how to use
Azure HDInsight Spark Clusters to build data science solutions, one using Python and
the other Scala. For more information on Azure HDInsight Spark Clusters, see Overview:
Apache Spark on HDInsight Linux. To learn how to build a data science solution using
Python on an Azure HDInsight Spark Cluster, see Overview of Data Science using Spark
on Azure HDInsight. To learn how to build a data science solution using Scala on an
Azure HDInsight Spark Cluster, see Data Science using Scala and Spark on Azure.
Azure Synapse Analytics
Azure Synapse Analytics allows you to scale compute resources easily and in seconds,
without over-provisioning or over-paying. It also offers the unique option to pause the
use of compute resources, giving you the freedom to better manage your cloud costs.
The ability to deploy scalable compute resources makes it possible to bring all your data
into Azure Synapse Analytics. Storage costs are minimal and you can run compute only
on the parts of datasets that you want to analyze.

For more information on Azure Synapse Analytics, see the Azure Synapse Analytics
website. To learn how to build end-to-end advanced analytics solutions with Azure
Synapse Analytics, see The Team Data Science Process in action: using Azure Synapse
Analytics.

Azure Data Lake


Azure Data Lake is as an enterprise-wide repository of every type of data collected in a
single location, prior to any formal requirements, or schema being imposed. This
flexibility allows every type of data to be kept in a data lake, regardless of its size or
structure or how fast it is ingested. Organizations can then use Hadoop or advanced
analytics to find patterns in these data lakes. Data lakes can also serve as a repository for
lower-cost data preparation before curating the data and moving it into a data
warehouse.

For more information on Azure Data Lake, see Introducing Azure Data Lake . To learn
how to build a scalable end-to-end data science solution with Azure Data Lake, see
Scalable Data Science in Azure Data Lake: An end-to-end Walkthrough

Azure HDInsight Hive (Hadoop) clusters


Apache Hive is a data warehouse system for Hadoop, which enables data
summarization, querying, and the analysis of data using HiveQL, a query language
similar to SQL. Hive can be used to interactively explore your data or to create reusable
batch processing jobs.

Hive allows you to project structure on largely unstructured data. After you define the
structure, you can use Hive to query that data in a Hadoop cluster without having to
use, or even know, Java or MapReduce. HiveQL (the Hive query language) allows you to
write queries with statements that are similar to T-SQL.
For data scientists, Hive can run Python User-Defined Functions (UDFs) in Hive queries
to process records. This ability extends the capability of Hive queries in data analysis
considerably. Specifically, it allows data scientists to conduct scalable feature
engineering in languages they're mostly familiar with: the SQL-like HiveQL and Python.

For more information on Azure HDInsight Hive Clusters, see Use Hive and HiveQL with
Hadoop in HDInsight. To learn how to build a scalable end-to-end data science solution
with Azure HDInsight Hive Clusters, see The Team Data Science Process in action: using
HDInsight Hadoop clusters.

Azure File Storage


Azure File Storage is a service that offers file shares in the cloud using the standard
Server Message Block (SMB) Protocol. Both SMB 2.1 and SMB 3.0 are supported. With
Azure File storage, you can migrate legacy applications that rely on file shares to Azure
quickly and without costly rewrites. Applications running in Azure virtual machines or
cloud services or from on-premises clients can mount a file share in the cloud, just as a
desktop application mounts a typical SMB share. Any number of application
components can then mount and access the File storage share simultaneously.

Especially useful for data science projects is the ability to create an Azure file store as
the place to share project data with your project team members. Each of them then has
access to the same copy of the data in the Azure file storage. They can also use this file
storage to share feature sets generated during the execution of the project. If the
project is a client engagement, your clients can create an Azure file storage under their
own Azure subscription to share the project data and features with you. In this way, the
client has full control of the project data assets. For more information on Azure File
Storage, see Get started with Azure File storage on Windows and How to use Azure File
Storage with Linux.

SQL Server 2019 R and Python Services


R Services (In-database) provides a platform for developing and deploying intelligent
applications that can uncover new insights. You can use the rich and powerful R
language, including the many packages provided by the R community, to create models
and generate predictions from your SQL Server data. Because R Services (In-database)
integrates the R language with SQL Server, analytics are kept close to the data, which
eliminates the costs and security risks associated with moving data.

R Services (In-database) supports the open source R language with a comprehensive set
of SQL Server tools and technologies. They offer superior performance, security,
reliability, and manageability. You can deploy R solutions using convenient and familiar
tools. Your production applications can call the R runtime and retrieve predictions and
visuals using Transact-SQL. You also use the ScaleR libraries to improve the scale and
performance of your R solutions. For more information, see SQL Server R Services.

The TDSP team from Microsoft has published two end-to-end walkthroughs that show
how to build data science solutions in SQL Server 2016 R Services: one for R
programmers and one for SQL developers. For R Programmers, see Data Science End-
to-End Walkthrough. For SQL Developers, see In-Database Advanced Analytics for SQL
Developers (Tutorial).

Appendix: Tools to set up data science projects

Install Git Credential Manager on Windows


If you're following the TDSP on Windows, you need to install the Git Credential
Manager (GCM) to communicate with the Git repositories. To install GCM, you first need
to install Chocolaty. To install Chocolaty and the GCM, run the following commands in
Windows PowerShell as an Administrator:

PowerShell

iwr https://chocolatey.org/install.ps1 -UseBasicParsing | iex


choco install git-credential-manager-for-windows -y

Install Git on Linux (CentOS) machines


Run the following bash command to install Git on Linux (CentOS) machines:

PowerShell

sudo yum install git

Generate public SSH key on Linux (CentOS) machines


If you're using Linux (CentOS) machines to run the git commands, you need to add the
public SSH key of your machine to your Azure DevOps services. This way the machine is
recognized by the Azure DevOps Services. First, you need to generate a public SSH key
and add the key to SSH public keys in your Azure DevOps services security setting page.

1. To generate the SSH key, run the following two commands:


ssh-keygen
cat .ssh/id_rsa.pub

2. Copy the entire ssh key including ssh-rsa.

3. Log in to your Azure DevOps Services.

4. Click <Your Name> at the top-right corner of the page and click security.

5. Click SSH public keys, and click +Add.

6. Paste the ssh key copied into the text box and save.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Understand data science for machine learning
Machine learning at scale
Introduction to Azure Machine Learning
Databricks Data Science & Engineering

Related resources
What is the Team Data Science Process?
Team Data Science Process roles and tasks
Compare the machine learning products and technologies from Microsoft
Identify scenarios and plan for
advanced analytics data processing
Article • 01/06/2023

What resources are required for you to create an environment that can perform
advanced analytics processing on a dataset? This article suggests a series of questions
to ask that can help identify tasks and resources relevant your scenario.

To learn about the order of high-level steps for predictive analytics, see What is the
Team Data Science Process (TDSP). Each step requires specific resources for the tasks
relevant to your particular scenario.

Answer key questions in the following areas to identify your scenario:

data logistics
data characteristics
dataset quality
preferred tools and languages

Logistic questions: data locations and


movement
The logistic questions cover the following items:

data source location


target destination in Azure
requirements for moving the data, including the schedule, amount, and resources
involved

You may need to move the data several times during the analytics process. A common
scenario is to move local data into some form of storage on Azure and then into Azure
Machine Learning.

What is your data source?


Is your data local or in the cloud? Possible locations include:

a publicly available HTTP address


a local or network file location
a SQL Server database
an Azure Storage container

What is the Azure destination?


Where does your data need to be for processing or modeling?

Azure Machine Learning


Azure Blob Storage
SQL Azure databases
SQL Server on Azure VM
HDInsight (Hadoop on Azure) or Hive tables
Mountable Azure virtual hard disks

How are you going to move the data?


For procedures and resources to ingest or load data into a variety of different storage
and processing environments, see:

Load data into storage environments for analytics


Secure data access in Azure Machine Learning

Does the data need to be moved on a regular schedule or


modified during migration?
Consider using Azure Data Factory (ADF) when data needs to be continually migrated.
ADF can be helpful for:

a hybrid scenario that involves both on-premises and cloud resources


a scenario where the data is transacted, modified, or changed by business logic in
the course of being migrated

For more information, see Move data from a SQL Server database to SQL Azure with
Azure Data Factory.

How much of the data is to be moved to Azure?


Large datasets may exceed the storage capacity of certain compute clusters. In such
cases, you might use a sample of the data during the analysis. For details of how to
down-sample a dataset in various Azure environments, see Sample data in the Team
Data Science Process.
Data characteristics questions: type, format,
and size
These questions are key to planning your storage and processing environments. They
will help you choose the appropriate scenario for your data type and understand any
restrictions.

What are the data types?


Numerical
Categorical
Strings
Binary

How is your data formatted?


Comma-separated (CSV) or tab-separated (TSV) flat files
Compressed or uncompressed
Azure blobs
Hadoop Hive tables
SQL Server tables

How large is your data?


Small: Less than 2 GB
Medium: Greater than 2 GB and less than 10 GB
Large: Greater than 10 GB

As applied to Azure Machine Learning:

Data ingestion options for Azure Machine Learning workflows.


Optimize data processing with Azure Machine Learning.

Data quality questions: exploration and pre-


processing

What do you know about your data?


Understand the basic characteristics about your data:
What patterns or trends it exhibits
What outliers it has
How many values are missing

This step is important to help you:

Determine how much pre-processing is needed


Formulate hypotheses that suggest the most appropriate features or type of
analysis
Formulate plans for additional data collection

Useful techniques for data inspection include descriptive statistics calculation and
visualization plots. For details of how to explore a dataset in various Azure
environments, see Explore data in the Team Data Science Process.

Does the data require preprocessing or cleaning?


You might need to preprocess and clean your data before you can use the dataset
effectively for machine learning. Raw data is often noisy and unreliable. It might be
missing values. Using such data for modeling can produce misleading results. For a
description, see Tasks to prepare data for enhanced machine learning.

Tools and languages questions


There are many options for languages, development environments, and tools. Be aware
of your needs and preferences.

What languages do you prefer to use for analysis?


R
Python
SQL
Other

What tools could you use for data analysis?


Azure Machine Learning uses Jupyter notebooks for data analysis. In addition to this
recommended environment, here are other options often paired in intermediate to
advanced enterprise scenarios.
Microsoft Azure PowerShell - a script language used to administer your Azure
resources in a script language
RStudio
Python Tools for Visual Studio
Microsoft Power BI

Identify your advanced analytics scenario


After you have answered the questions in the previous section, you are ready to
determine which scenario best fits your case. The sample scenarios are outlined in
Scenarios for advanced analytics in Azure Machine Learning.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is the Team Data Science Process (TDSP)?

Related resources
Execute data science tasks: exploration, modeling, and deployment
Set up data science environments for use in the Team Data Science Process
Platforms and tools for data science projects
Data science and machine learning with Azure Databricks
Load data into storage environments for
analytics
Article • 11/15/2022

The Team Data Science Process requires that data be ingested or loaded into the most
appropriate way in each stage. Data destinations can include Azure Blob Storage, SQL
Azure databases, SQL Server on Azure VM, HDInsight (Hadoop), Azure Synapse
Analytics, and Azure Machine Learning.

The following articles describe how to ingest data into various target environments
where the data is stored and processed.

To/From Azure Blob Storage


To SQL Server on Azure VM
To Azure SQL Database
To Hive tables
To SQL partitioned tables
From On-premises SQL Server

Technical and business needs, as well as the initial location, format, and size of your data
will determine the best data ingestion plan. It is not uncommon for a best plan to have
several steps. This sequence of tasks can include, for example, data exploration, pre-
processing, cleaning, down-sampling, and model training. Azure Data Factory is a
recommended Azure resource to orchestrate data movement and transformation.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Move data to and from Azure Blob
storage
Article • 01/06/2023

The Team Data Science Process requires that data be ingested or loaded into a variety of
different storage environments to be processed or analyzed in the most appropriate
way in each stage of the process. Azure Blob Storage has comprehensive documentation
at this link but this section in TDSP documentation provides a summary starter.

Different technologies for moving data


The following articles describe how to move data to and from Azure Blob storage using
different technologies.

Azure Storage Explorer


AzCopy
Python
SSIS

Which method is best for you depends on your scenario. The Scenarios for advanced
analytics in Azure Machine Learning article helps you determine the resources you need
for a variety of data science workflows used in the advanced analytics process.

7 Note

For a complete introduction to Azure blob storage, refer to Azure Blob Basics and
to Azure Blob Service.

Using Azure Data Factory


As an alternative, you can use Azure Data Factory to do the following:

Create and schedule a pipeline that downloads data from Azure Blob storage.
Pass it to a published Azure Machine Learning web service.
Receive the predictive analytics results.
Upload the results to storage.

For more information, see Create predictive pipelines using Azure Data Factory and
Azure Machine Learning.
Prerequisites
This article assumes that you have an Azure subscription, a storage account, and the
corresponding storage key for that account. Before uploading/downloading data, you
must know your Azure Storage account name and account key.

To set up an Azure subscription, see Free one-month trial .


For instructions on creating a storage account and for getting account and key
information, see About Azure Storage accounts.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Introduction to Azure Blob Storage
Copy and move blobs from one container or storage account to another
What is the Team Data Science Process (TDSP)?

Related resources
Explore data in Azure Blob storage
Process Azure Blob Storage data with advanced analytics
Set up data science environments for use in the Team Data Science Process
Load data into storage environments for analytics
Move data to and from Azure Blob
Storage using Azure Storage Explorer
Article • 01/06/2023

Azure Storage Explorer is a free tool from Microsoft that allows you to work with Azure
Storage data on Windows, macOS, and Linux. This topic describes how to use it to
upload and download data from Azure Blob Storage. The tool can be downloaded from
Microsoft Azure Storage Explorer .

This menu links to technologies you can use to move data to and from Azure Blob
storage:

7 Note

If you are using VM that was set up with the scripts provided by Data Science
Virtual machines in Azure, then Azure Storage Explorer is already installed on the
VM.

7 Note

For a complete introduction to Azure Blob Storage, refer to Azure Blob Basics and
Azure Blob Service REST API.

Prerequisites
This document assumes that you have an Azure subscription, a storage account, and the
corresponding storage key for that account. Before uploading/downloading data, you
must know your Azure Storage account name and account key.

To set up an Azure subscription, see Free one-month trial .


For instructions on creating a storage account and for getting account and key
information, see About Azure Storage accounts. Make a note the access key for
your storage account as you need this key to connect to the account with the
Azure Storage Explorer tool.
The Azure Storage Explorer tool can be downloaded from Microsoft Azure Storage
Explorer . Accept the defaults during install.
Use Azure Storage Explorer
The following steps document how to upload/download data using Azure Storage
Explorer.

1. Launch Microsoft Azure Storage Explorer.


2. To bring up the Sign in to your account... wizard, select Azure account settings
icon, then Add an account and enter your credentials.

3. To bring up the Connect to Azure Storage wizard, select the Connect to Azure
Storage icon.

4. Enter the access key from your Azure Storage account on the Connect to Azure
Storage wizard and then Next.
5. Enter storage account name in the Account name box and then select Next.

6. The storage account added should now be displayed. To create a blob container in
a storage account, right-click the Blob Containers node in that account, select
Create Blob Container, and enter a name.
7. To upload data to a container, select the target container and click the Upload
button.

8. Click on the ... to the right of the Files box, select one or multiple files to upload
from the file system and click Upload to begin uploading the files.
9. To download data, selecting the blob in the corresponding container to download
and click Download.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:
Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Introduction to Azure Blob Storage
Upload, download, and manage data with Azure Storage Explorer
What is the Team Data Science Process (TDSP)?

Related resources
Explore data in Azure Blob storage
Process Azure Blob Storage data with advanced analytics
Set up data science environments for use in the Team Data Science Process
Load data into storage environments for analytics
Move data to or from Azure Blob
Storage using SSIS connectors
Article • 01/06/2023

The Azure Feature Pack for Integration Services (SSIS) provides components to connect
to Azure, transfer data between Azure and on-premises data sources, and process data
stored in Azure.

This menu links to technologies you can use to move data to and from Azure Blob
storage:

Once customers have moved on-premises data into the cloud, they can access their data
from any Azure service to leverage the full power of the suite of Azure technologies. The
data may be subsequently used, for example, in Azure Machine Learning or on an
HDInsight cluster.

Examples for using these Azure resources are in the SQL and HDInsight walkthroughs.

For a discussion of canonical scenarios that use SSIS to accomplish business needs
common in hybrid data integration scenarios, see Doing more with SQL Server
Integration Services Feature Pack for Azure blog.

7 Note

For a complete introduction to Azure blob storage, refer to Azure Blob Basics and
to Azure Blob Service REST API.

Prerequisites
To perform the tasks described in this article, you must have an Azure subscription and
an Azure Storage account set up. You need the Azure Storage account name and
account key to upload or download data.

To set up an Azure subscription, see Free one-month trial .


For instructions on creating a storage account and for getting account and key
information, see About Azure Storage accounts.

To use the SSIS connectors, you must download:


SQL Server 2014 or 2016 Standard (or above): Install includes SQL Server
Integration Services.
Microsoft SQL Server 2014 or 2016 Integration Services Feature Pack for Azure:
These connectors can be downloaded, respectively, from the SQL Server 2014
Integration Services and SQL Server 2016 Integration Services pages.

7 Note

SSIS is installed with SQL Server, but is not included in the Express version. For
information on what applications are included in various editions of SQL Server, see
SQL Server Technical Documentation

For installing SSIS, see Install Integration Services (SSIS)

For information on how to get up-and-running using SISS to build simple extraction,
transformation, and load (ETL) packages, see SSIS Tutorial: Creating a Simple ETL
Package.

Download NYC Taxi dataset


The example described here use a publicly available dataset, either available through
Azure Open Datasets or from the source TLC Trip Record Data . The dataset consists of
about 173 million taxi rides in NYC in the year 2013. There are two types of data: trip
details data and fare data.

Upload data to Azure blob storage


To move data using the SSIS feature pack from on-premises to Azure blob storage, we
use an instance of the Azure Blob Upload Task, shown here:

The parameters that the task uses are described here:

Field Description
Field Description

AzureStorageConnection Specifies an existing Azure Storage Connection Manager or


creates a new one that refers to an Azure Storage account that
points to where the blob files are hosted.

BlobContainer Specifies the name of the blob container that holds the
uploaded files as blobs.

BlobDirectory Specifies the blob directory where the uploaded file is stored as
a block blob. The blob directory is a virtual hierarchical
structure. If the blob already exists, it is replaced.

LocalDirectory Specifies the local directory that contains the files to be


uploaded.

FileName Specifies a name filter to select files with the specified name
pattern. For example, MySheet*.xls* includes files such as
MySheet001.xls and MySheetABC.xlsx

TimeRangeFrom/TimeRangeTo Specifies a time range filter. Files modified after


TimeRangeFrom and before TimeRangeTo are included.

7 Note

The AzureStorageConnection credentials need to be correct and the


BlobContainer must exist before the transfer is attempted.

Download data from Azure blob storage


To download data from Azure blob storage to on-premises storage with SSIS, use an
instance of the Azure Blob Download Task.

More advanced SSIS-Azure scenarios


The SSIS feature pack allows for more complex flows to be handled by packaging tasks
together. For example, the blob data could feed directly into an HDInsight cluster,
whose output could be downloaded back to a blob and then to on-premises storage.
SSIS can run Hive and Pig jobs on an HDInsight cluster using additional SSIS connectors:

To run a Hive script on an Azure HDInsight cluster with SSIS, use Azure HDInsight
Hive Task.
To run a Pig script on an Azure HDInsight cluster with SSIS, use Azure HDInsight
Pig Task.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Introduction to Azure Blob Storage
Copy and move blobs from one container or storage account to another
Execute existing SSIS packages in Azure Data Factory or Azure Synapse Pipeline
What is the Team Data Science Process (TDSP)?

Related resources
Explore data in Azure Blob storage
Process Azure Blob Storage data with advanced analytics
Move data to and from Azure Blob Storage using Azure Storage Explorer
Load data into storage environments for analytics
Move data to SQL Server on an Azure
virtual machine
Article • 01/06/2023

This article outlines the options for moving data either from flat files (CSV or TSV
formats) or from an on-premises SQL Server to SQL Server on an Azure virtual machine.
These tasks for moving data to the cloud are part of the Team Data Science Process.

For a topic that outlines the options for moving data to an Azure SQL Database for
Machine Learning, see Move data to an Azure SQL Database for Azure Machine
Learning.

The following table summarizes the options for moving data to SQL Server on an Azure
virtual machine.

SOURCE DESTINATION: SQL Server on Azure VM

Flat File 1. Command-line bulk copy utility (BCP)


2. Bulk Insert SQL Query
3. Graphical Built-in Utilities in SQL Server

On-Premises SQL Server 1. Deploy a SQL Server Database to a Microsoft Azure VM wizard
2. Export to a flat File
3. SQL Database Migration Wizard
4. Database back up and restore

This document assumes that SQL commands are executed from SQL Server
Management Studio or Visual Studio Database Explorer.

 Tip

As an alternative, you can use Azure Data Factory to create and schedule a
pipeline that will move data to a SQL Server VM on Azure. For more information,
see Copy data with Azure Data Factory (Copy Activity).

Prerequisites
This tutorial assumes you have:

An Azure subscription. If you do not have a subscription, you can sign up for a
free trial .
An Azure storage account. You will use an Azure storage account for storing the
data in this tutorial. If you don't have an Azure storage account, see the Create a
storage account article. After you have created the storage account, you will need
to obtain the account key used to access the storage. See Manage storage account
access keys.
Provisioned SQL Server on an Azure VM. For instructions, see Set up an Azure
virtual machine for SQL Server as an IPython Notebook server for advanced
analytics.
Installed and configured Azure PowerShell locally. For instructions, see How to
install and configure Azure PowerShell.

Moving data from a flat file source to SQL


Server on an Azure VM
If your data is in a flat file (arranged in a row/column format), it can be moved to SQL
Server VM on Azure via the following methods:

1. Command-line bulk copy utility (BCP)


2. Bulk Insert SQL Query
3. Graphical Built-in Utilities in SQL Server (Import/Export, SSIS)

Command-line bulk copy utility (BCP)


BCP is a command-line utility installed with SQL Server and is one of the quickest ways
to move data. It works across all three SQL Server variants (On-premises SQL Server, SQL
Azure, and SQL Server VM on Azure).

7 Note

Where should my data be for BCP? While it is not required, having files containing
source data located on the same machine as the target SQL Server allows for faster
transfers (network speed vs local disk IO speed). You can move the flat files
containing data to the machine where SQL Server is installed using various file
copying tools such as AZCopy, Azure Storage Explorer or windows copy/paste
via Remote Desktop Protocol (RDP).

1. Ensure that the database and the tables are created on the target SQL Server
database. Here is an example of how to do that using the Create Database and
Create Table commands:
SQL

CREATE DATABASE <database_name>

CREATE TABLE <tablename>


(
<columnname1> <datatype> <constraint>,
<columnname2> <datatype> <constraint>,
<columnname3> <datatype> <constraint>
)

2. Generate the format file that describes the schema for the table by issuing the
following command from the command line of the machine where bcp is installed.

bcp dbname..tablename format nul -c -x -f exportformatfilename.xml -S


servername\sqlinstance -T -t \t -r \n

3. Insert the data into the database using the bcp command, which should work from
the command line when SQL Server is installed on same machine:

bcp dbname..tablename in datafilename.tsv -f exportformatfilename.xml -S

servername\sqlinstancename -U username -P password -b


block_size_to_move_in_single_attempt -t \t -r \n

Optimizing BCP Inserts Please refer the following article 'Guidelines for Optimizing
Bulk Import' to optimize such inserts.

Parallelizing Inserts for Faster Data Movement


If the data you are moving is large, you can speed up things by simultaneously
executing multiple BCP commands in parallel in a PowerShell Script.

7 Note

Big data Ingestion To optimize data loading for large and very large datasets,
partition your logical and physical database tables using multiple file groups and
partition tables. For more information about creating and loading data to partition
tables, see Parallel Load SQL Partition Tables.

The following sample PowerShell script demonstrates parallel inserts using bcp:

PowerShell
$NO_OF_PARALLEL_JOBS=2

Set-ExecutionPolicy RemoteSigned #set execution policy for the script to


execute
# Define what each job does
$ScriptBlock = {
param($partitionnumber)

#Explicitly using SQL username password


bcp database..tablename in datafile_path.csv -F 2 -f
format_file_path.xml -U username@servername -S tcp:servername -P password -b
block_size_to_move_in_single_attempt -t "," -r \n -o
path_to_outputfile.$partitionnumber.txt

#Trusted connection w.o username password (if you are using windows auth
and are signed in with that credentials)
#bcp database..tablename in datafile_path.csv -o
path_to_outputfile.$partitionnumber.txt -h "TABLOCK" -F 2 -f
format_file_path.xml -T -b block_size_to_move_in_single_attempt -t "," -r
\n
}

# Background processing of all partitions


for ($i=1; $i -le $NO_OF_PARALLEL_JOBS; $i++)
{
Write-Debug "Submit loading partition # $i"
Start-Job $ScriptBlock -Arg $i
}

# Wait for it all to complete


While (Get-Job -State "Running")
{
Start-Sleep 10
Get-Job
}

# Getting the information back from the jobs


Get-Job | Receive-Job
Set-ExecutionPolicy Restricted #reset the execution policy

Bulk Insert SQL Query


Bulk Insert SQL Query can be used to import data into the database from row/column
based files (the supported types are covered in thePrepare Data for Bulk Export or
Import (SQL Server)) topic.

Here are some sample commands for Bulk Insert are as below:

1. Analyze your data and set any custom options before importing to make sure that
the SQL Server database assumes the same format for any special fields such as
dates. Here is an example of how to set the date format as year-month-day (if your
data contains the date in year-month-day format):

SQL

SET DATEFORMAT ymd;

2. Import data using bulk import statements:

SQL

BULK INSERT <tablename>


FROM
'<datafilename>'
WITH
(
FirstRow = 2,
FIELDTERMINATOR = ',', --this should be column separator in your
data
ROWTERMINATOR = '\n' --this should be the row separator in your
data
)

Built-in Utilities in SQL Server


You can use SQL Server Integration Services (SSIS) to import data into SQL Server VM on
Azure from a flat file. SSIS is available in two studio environments. For details, see
Integration Services (SSIS) and Studio Environments:

For details on SQL Server Data Tools, see Microsoft SQL Server Data Tools
For details on the Import/Export Wizard, see SQL Server Import and Export Wizard

Moving Data from on-premises SQL Server to


SQL Server on an Azure VM
You can also use the following migration strategies:

1. Deploy a SQL Server Database to a Microsoft Azure VM wizard


2. Export to Flat File
3. SQL Server Migration Assistant (SSMA)
4. Database back up and restore

We describe each of these options below:


Deploy a SQL Server Database to a Microsoft Azure VM
wizard
The Deploy a SQL Server Database to a Microsoft Azure VM wizard is a simple and
recommended way to move data from an on-premises SQL Server instance to SQL
Server on an Azure VM. For detailed steps as well as a discussion of other alternatives,
see Migrate a database to SQL Server on an Azure VM.

Export to Flat File


Various methods can be used to bulk export data from an on-premises SQL Server as
documented in the Bulk Import and Export of Data (SQL Server) topic. This document
will cover the Bulk Copy Program (BCP) as an example. Once data is exported into a flat
file, it can be imported to another SQL server using bulk import.

1. Export the data from on-premises SQL Server to a file using the bcp utility as
follows

bcp dbname..tablename out datafile.tsv -S servername\sqlinstancename -T -t \t


-t \n -c

2. Create the database and the table on SQL Server VM on Azure using the create
database and create table for the table schema exported in step 1.

3. Create a format file for describing the table schema of the data being
exported/imported. Details of the format file are described in Create a Format File
(SQL Server).

Format file generation when running BCP from the SQL Server computer

bcp dbname..tablename format nul -c -x -f exportformatfilename.xml -S

servername\sqlinstance -T -t \t -r \n

Format file generation when running BCP remotely against a SQL Server

bcp dbname..tablename format nul -c -x -f exportformatfilename.xml -U

username@servername.database.windows.net -S tcp:servername -P password --t \t


-r \n

4. Use any of the methods described in section Moving Data from File Source to
move the data in flat files to a SQL Server.

SQL Server Migration Assistant (SSMA)


SQL Server Migration Assistant (SSMA) provides a user-friendly way to move data
between two SQL server instances. It allows the user to map the data schema between
sources and destination tables, choose column types and various other functionalities. It
uses bulk copy (BCP) under the covers. A screenshot of the welcome screen for SQL
Server Migration Assistant (SSMA) is shown below.

Database back up and restore


SQL Server supports:

1. Database back up and restore functionality (both to a local file or bacpac export to
blob) and Data Tier Applications (using bacpac).
2. Ability to directly create SQL Server VMs on Azure with a copied database or copy
to an existing database in SQL Database. For more information, see Use the Copy
Database Wizard.

A screenshot of the Database back up/restore options from SQL Server Management
Studio is shown below.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Migrate a Database to SQL Server on an Azure VM
SQL Server on Azure Virtual Machines overview

Related resources
Move data to an Azure SQL Database for Azure Machine Learning
Move data from a SQL Server database to SQL Database with Azure Data Factory
Process data in a SQL Server virtual machine on Azure
What is the Team Data Science Process?
Move data to Azure SQL Database for
Azure Machine Learning
Article • 11/15/2022

This article outlines the options for moving data either from flat files (CSV or TSV
formats) or from data stored in SQL Server to an Azure SQL Database. These tasks for
moving data to the cloud are part of the Team Data Science Process.

For a topic that outlines the options for migrating data from SQL Server into Azure SQL
options, see Migrate to Azure SQL.

The following table summarizes the options for moving data to an Azure SQL Database.

SOURCE DESTINATION: Azure SQL

Flat file (CSV or TSV formatted) Bulk Insert SQL Query

On-premises SQL Server 1.Export to Flat File


2. SQL Server Migration Assistant (SSMA)
3. Database back up and restore
4. Azure Data Factory

Prerequisites
The procedures outlined here require that you have:

An Azure subscription. If you do not have a subscription, you can sign up for a
free trial .
An Azure storage account. You use an Azure storage account for storing the data
in this tutorial. If you don't have an Azure storage account, see the Create a
storage account article. After you have created the storage account, you need to
obtain the account key used to access the storage. See Manage storage account
access keys.
Access to an Azure SQL Database. If you must set up an Azure SQL Database,
Getting Started with Microsoft Azure SQL Database provides information on how
to provision a new instance of an Azure SQL Database.
Installed and configured Azure PowerShell locally. For instructions, see How to
install and configure Azure PowerShell.

Data: The migration processes are demonstrated using the NYC Taxi dataset . The NYC
Taxi dataset contains information on trip data and fares, which is either available
through Azure Open Datasets or from the source TLC Trip Record Data . A sample and
description of these files are provided in NYC Taxi Trips Dataset Description.

You can either adapt the procedures described here to a set of your own data or follow
the steps as described by using the NYC Taxi dataset. To upload the NYC Taxi dataset
into your SQL Server database, follow the procedure outlined in Bulk Import Data into
SQL Server Database.

Moving data from a flat file source to an Azure


SQL Database
Data in flat files (CSV or TSV formatted) can be moved to an Azure SQL Database using
a Bulk Insert SQL Query.

Bulk Insert SQL Query


The steps for the procedure using the Bulk Insert SQL Query are similar to the directions
for moving data from a flat file source to SQL Server on an Azure VM. For details, see
Bulk Insert SQL Query.

Moving Data from SQL Server to an Azure SQL


Database
If the source data is stored in SQL Server, there are various possibilities for moving the
data to an Azure SQL Database:

1. Export to Flat File


2. SQL Server Migration Assistant (SSMA)
3. Database back up and restore
4. Azure Data Factory

The steps for the first three are similar to those sections in Move data to SQL Server on
an Azure virtual machine that cover these same procedures. Links to the appropriate
sections in that topic are provided in the following instructions.

Export to Flat File


The steps for this exporting to a flat file are similar to those directions covered in Export
to Flat File.
SQL Server Migration Assistant (SSMA)
The steps for using the SQL Server Migration Assistant (SSMA) are similar to those
directions covered in SQL Server Migration Assistant (SSMA).

Database back up and restore


The steps for using database backup and restore are similar to those directions listed in
Database backup and restore.

Azure Data Factory


Learn how to move data to an Azure SQL Database with Azure Data Factory (ADF) in this
topic, Move data from a SQL Server to SQL Azure with Azure Data Factory. This topic
shows how to use ADF to move data from a SQL Server database to an Azure SQL
Database via Azure Blob Storage.

Consider using ADF when data needs to be continually migrated with hybrid on-
premises and cloud sources. ADF also helps when the data needs transformations, or
needs new business logic during migration. ADF allows for the scheduling and
monitoring of jobs using simple JSON scripts that manage the movement of data on a
periodic basis. ADF also has other capabilities such as support for complex operations.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Create Hive tables and load data from
Azure Blob Storage
Article • 11/15/2022

This article presents generic Hive queries that create Hive tables and load data from
Azure Blob Storage. Some guidance is also provided on partitioning Hive tables and on
using the Optimized Row Columnar (ORC) formatting to improve query performance.

Prerequisites
This article assumes that you have:

Created an Azure Storage account. If you need instructions, see About Azure
Storage accounts.
Provisioned a customized Hadoop cluster with the HDInsight service. If you need
instructions, see Setup Clusters in HDInsight.
Enabled remote access to the cluster, logged in, and opened the Hadoop
Command-Line console. If you need instructions, see Manage Apache Hadoop
clusters.

Upload data to Azure Blob Storage


If you created an Azure virtual machine by following the instructions provided in Set up
an Azure virtual machine for advanced analytics, this script file should have been
downloaded to the C:\Users\<user name>\Documents\Data Science Scripts directory on
the virtual machine. These Hive queries only require that you provide a data schema and
Azure Blob Storage configuration in the appropriate fields to be ready for submission.

We assume that the data for Hive tables is in an uncompressed tabular format, and that
the data has been uploaded to the default (or to an additional) container of the storage
account used by the Hadoop cluster.

If you want to practice on the NYC Taxi Trip Data, you need to:

download the 24 NYC Taxi Trip Data files (12 Trip files and 12 Fare files) -- either
available through Azure Open Datasets or from the source TLC Trip Record Data ,
unzip all files into .csv files, and then
upload them to the default (or appropriate container) of the Azure Storage
account; options for such an account appear at Use Azure Storage with Azure
HDInsight clusters topic. The process to upload the .csv files to the default
container on the storage account can be found on this page.

How to submit Hive queries


Hive queries can be submitted by using:

Submit Hive queries through Hadoop Command Line in headnode of Hadoop


cluster
Submit Hive queries with the Hive Editor
Submit Hive queries with Azure PowerShell Commands

Hive queries are SQL-like. If you are familiar with SQL, you may find the Hive for SQL
Users Cheat Sheet useful.

When submitting a Hive query, you can also control the destination of the output from
Hive queries, whether it be on the screen or to a local file on the head node or to an
Azure blob.

Submit Hive queries through Hadoop Command Line in


headnode of Hadoop cluster
If the Hive query is complex, submitting it directly in the head node of the Hadoop
cluster typically leads to faster turn around than submitting it with a Hive Editor or Azure
PowerShell scripts.

Log in to the head node of the Hadoop cluster, open the Hadoop Command Line on the
desktop of the head node, and enter command cd %hive_home%\bin .

You have three ways to submit Hive queries in the Hadoop Command Line:

directly
using .hql files
with the Hive command console

Submit Hive queries directly in Hadoop Command Line.


You can run command like hive -e "<your hive query>; to submit simple Hive queries
directly in Hadoop Command Line. Here is an example, where the red box outlines the
command that submits the Hive query, and the green box outlines the output from the
Hive query.
Submit Hive queries in .hql files
When the Hive query is more complicated and has multiple lines, editing queries in
command line or Hive command console is not practical. An alternative is to use a text
editor in the head node of the Hadoop cluster to save the Hive queries in an .hql file in
a local directory of the head node. Then the Hive query in the .hql file can be
submitted by using the -f argument as follows:

Console

hive -f "<path to the .hql file>"

Suppress progress status screen print of Hive queries

By default, after Hive query is submitted in Hadoop Command Line, the progress of the
Map/Reduce job is printed out on screen. To suppress the screen print of the
Map/Reduce job progress, you can use an argument -S ("S" in upper case) in the
command line as follows:

Console

hive -S -f "<path to the .hql file>"


hive -S -e "<Hive queries>"

Submit Hive queries in Hive command console.

You can also first enter the Hive command console by running command hive in
Hadoop Command Line, and then submit Hive queries in Hive command console. Here
is an example. In this example, the two red boxes highlight the commands used to enter
the Hive command console, and the Hive query submitted in Hive command console,
respectively. The green box highlights the output from the Hive query.

The previous examples directly output the Hive query results on screen. You can also
write the output to a local file on the head node, or to an Azure blob. Then, you can use
other tools to further analyze the output of Hive queries.

Output Hive query results to a local file. To output Hive query results to a local
directory on the head node, you have to submit the Hive query in the Hadoop
Command Line as follows:

Console

hive -e "<hive query>" > <local path in the head node>


In the following example, the output of Hive query is written into a file
hivequeryoutput.txt in directory C:\apps\temp .

Output Hive query results to an Azure blob

You can also output the Hive query results to an Azure blob, within the default container
of the Hadoop cluster. The Hive query for this is as follows:

Console

insert overwrite directory wasb:///<directory within the default container>


<select clause from ...>

In the following example, the output of Hive query is written to a blob directory
queryoutputdir within the default container of the Hadoop cluster. Here, you only need

to provide the directory name, without the blob name. An error is thrown if you provide
both directory and blob names, such as wasb:///queryoutputdir/queryoutput.txt .
If you open the default container of the Hadoop cluster using Azure Storage Explorer,
you can see the output of the Hive query as shown in the following figure. You can apply
the filter (highlighted by red box) to only retrieve the blob with specified letters in
names.
Submit Hive queries with the Hive Editor
You can also use the Query Console (Hive Editor) by entering a URL of the form
https://<Hadoop cluster name>.azurehdinsight.net/Home/HiveEditor into a web browser.
You must be logged in the see this console and so you need your Hadoop cluster
credentials here.

Submit Hive queries with Azure PowerShell Commands


You can also use PowerShell to submit Hive queries. For instructions, see Submit Hive
jobs using PowerShell.

Create Hive database and tables


The Hive queries are shared in the GitHub repository and can be downloaded from
there.

Here is the Hive query that creates a Hive table.

HiveQL
create database if not exists <database name>;
CREATE EXTERNAL TABLE if not exists <database name>.<table name>
(
field1 string,
field2 int,
field3 float,
field4 double,
...,
fieldN string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' lines
terminated by '<line separator>'
STORED AS TEXTFILE LOCATION '<storage location>'
TBLPROPERTIES("skip.header.line.count"="1");

Here are the descriptions of the fields that you need to plug in and other configurations:

<database name>: the name of the database that you want to create. If you just
want to use the default database, the query "create database..." can be omitted.
<table name>: the name of the table that you want to create within the specified
database. If you want to use the default database, the table can be directly referred
by <table name> without <database name>.
<field separator>: the separator that delimits fields in the data file to be uploaded
to the Hive table.
<line separator>: the separator that delimits lines in the data file.
<storage location>: the Azure Storage location to save the data of Hive tables. If
you do not specify LOCATION <storage location>, the database and the tables are
stored in hive/warehouse/ directory in the default container of the Hive cluster by
default. If you want to specify the storage location, the storage location has to be
within the default container for the database and tables. This location has to be
referred as location relative to the default container of the cluster in the format of
'wasb:///<directory 1>/' or 'wasb:///<directory 1>/<directory 2>/', etc. After the
query is executed, the relative directories are created within the default container.
TBLPROPERTIES("skip.header.line.count"="1"): If the data file has a header line,
you have to add this property at the end of the create table query. Otherwise, the
header line is loaded as a record to the table. If the data file does not have a
header line, this configuration can be omitted in the query.

Load data to Hive tables


Here is the Hive query that loads data into a Hive table.

HiveQL
LOAD DATA INPATH '<path to blob data>' INTO TABLE <database name>.<table
name>;

<path to blob data>: If the blob file to be uploaded to the Hive table is in the
default container of the HDInsight Hadoop cluster, the <path to blob data> should
be in the format 'wasb://<directory in this container>/<blob file name>'. The blob
file can also be in an additional container of the HDInsight Hadoop cluster. In this
case, <path to blob data> should be in the format 'wasb://<container
name>@<storage account name>.blob.core.windows.net/<blob file name>'.

7 Note

The blob data to be uploaded to Hive table has to be in the default or


additional container of the storage account for the Hadoop cluster. Otherwise,
the LOAD DATA query fails complaining that it cannot access the data.

Advanced topics: partitioned table and store


Hive data in ORC format
If the data is large, partitioning the table is beneficial for queries that only need to scan
a few partitions of the table. For instance, it is reasonable to partition the log data of a
web site by dates.

In addition to partitioning Hive tables, it is also beneficial to store the Hive data in the
Optimized Row Columnar (ORC) format. For more information on ORC formatting, see
Using ORC files improves performance when Hive is reading, writing, and processing
data .

Partitioned table
Here is the Hive query that creates a partitioned table and loads data into it.

HiveQL

CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<table name>


(field1 string,
...
fieldN string
)
PARTITIONED BY (<partitionfieldname> vartype) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '<field separator>'
lines terminated by '<line separator>'
TBLPROPERTIES("skip.header.line.count"="1");
LOAD DATA INPATH '<path to the source file>' INTO TABLE <database name>.
<partitioned table name>
PARTITION (<partitionfieldname>=<partitionfieldvalue>);

When querying partitioned tables, it is recommended to add the partition condition in


the beginning of the where clause, which improves the search efficiency.

HiveQL

select
field1, field2, ..., fieldN
from <database name>.<partitioned table name>
where <partitionfieldname>=<partitionfieldvalue> and ...;

Store Hive data in ORC format


You cannot directly load data from blob storage into Hive tables that is stored in the
ORC format. Here are the steps that the you need to take to load data from Azure blobs
to Hive tables stored in ORC format.

Create an external table STORED AS TEXTFILE and load data from blob storage to the
table.

HiveQL

CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<external textfile table


name>
(
field1 string,
field2 int,
...
fieldN date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>'
lines terminated by '<line separator>' STORED AS TEXTFILE
LOCATION 'wasb:///<directory in Azure blob>'
TBLPROPERTIES("skip.header.line.count"="1");

LOAD DATA INPATH '<path to the source file>' INTO TABLE <database name>.
<table name>;

Create an internal table with the same schema as the external table in step 1, with the
same field delimiter, and store the Hive data in the ORC format.

HiveQL
CREATE TABLE IF NOT EXISTS <database name>.<ORC table name>
(
field1 string,
field2 int,
...
fieldN date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' STORED AS ORC;

Select data from the external table in step 1 and insert into the ORC table

HiveQL

INSERT OVERWRITE TABLE <database name>.<ORC table name>


SELECT * FROM <database name>.<external textfile table name>;

7 Note

If the TEXTFILE table <database name>.<external textfile table name> has


partitions, in STEP 3, the SELECT * FROM <database name>.<external textfile table
name> command selects the partition variable as a field in the returned data set.

Inserting it into the <database name>.<ORC table name> fails since <database
name>.<ORC table name> does not have the partition variable as a field in the
table schema. In this case, you need to specifically select the fields to be inserted to
<database name>.<ORC table name> as follows:

HiveQL

INSERT OVERWRITE TABLE <database name>.<ORC table name> PARTITION


(<partition variable>=<partition value>)
SELECT field1, field2, ..., fieldN
FROM <database name>.<external textfile table name>
WHERE <partition variable>=<partition value>;

It is safe to drop the <external text file table name> when using the following query
after all data has been inserted into <database name>.<ORC table name>:

HiveQL

DROP TABLE IF EXISTS <database name>.<external textfile table name>;

After following this procedure, you should have a table with data in the ORC format
ready to use.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Build and optimize tables for fast
parallel import of data into SQL Server
on an Azure VM
Article • 11/15/2022

This article describes how to build partitioned tables for fast parallel bulk importing of
data to a SQL Server database. For big data loading/transfer to a SQL database,
importing data to the SQL database and subsequent queries can be improved by using
Partitioned Tables and Views.

Create a new database and a set of filegroups


Create a new database, if it doesn't exist already.

Add database filegroups to the database, which holds the partitioned physical files.

This can be done with CREATE DATABASE if new or ALTER DATABASE if the
database exists already.

Add one or more files (as needed) to each database filegroup.

7 Note

Specify the target filegroup, which holds data for this partition and the
physical database file name(s) where the filegroup data is stored.

The following example creates a new database with three filegroups other than the
primary and log groups, containing one physical file in each. The database files are
created in the default SQL Server Data folder, as configured in the SQL Server instance.
For more information about the default file locations, see File Locations for Default and
Named Instances of SQL Server.

SQL

DECLARE @data_path nvarchar(256);


SET @data_path = (SELECT SUBSTRING(physical_name, 1,
CHARINDEX(N'master.mdf', LOWER(physical_name)) - 1)
FROM master.sys.master_files
WHERE database_id = 1 AND file_id = 1);

EXECUTE ('
CREATE DATABASE <database_name>
ON PRIMARY
( NAME = ''Primary'', FILENAME = ''' + @data_path +
'<primary_file_name>.mdf'', SIZE = 4096KB , FILEGROWTH = 1024KB ),
FILEGROUP [filegroup_1]
( NAME = ''FileGroup1'', FILENAME = ''' + @data_path +
'<file_name_1>.ndf'' , SIZE = 4096KB , FILEGROWTH = 1024KB ),
FILEGROUP [filegroup_2]
( NAME = ''FileGroup2'', FILENAME = ''' + @data_path +
'<file_name_2>.ndf'' , SIZE = 4096KB , FILEGROWTH = 1024KB ),
FILEGROUP [filegroup_3]
( NAME = ''FileGroup3'', FILENAME = ''' + @data_path +
'<file_name_3>.ndf'' , SIZE = 102400KB , FILEGROWTH = 10240KB )
LOG ON
( NAME = ''LogFileGroup'', FILENAME = ''' + @data_path +
'<log_file_name>.ldf'' , SIZE = 1024KB , FILEGROWTH = 10%)
')

Create a partitioned table


To create partitioned table(s) according to the data schema, mapped to the database
filegroups created in the previous step, you must first create a partition function and
scheme. When data is bulk imported to the partitioned table(s), records are distributed
among the filegroups according to a partition scheme, as described below.

1. Create a partition function


Create a partition function This function defines the range of values/boundaries to be
included in each individual partition table, for example, to limit partitions by
month(some_datetime_field) in the year 2013:

SQL

CREATE PARTITION FUNCTION <DatetimeFieldPFN>(<datetime_field>)


AS RANGE RIGHT FOR VALUES (
'20130201', '20130301', '20130401',
'20130501', '20130601', '20130701', '20130801',
'20130901', '20131001', '20131101', '20131201' )

2. Create a partition scheme


Create a partition scheme. This scheme maps each partition range in the partition
function to a physical filegroup, for example:

SQL
CREATE PARTITION SCHEME <DatetimeFieldPScheme> AS
PARTITION <DatetimeFieldPFN> TO (
<filegroup_1>, <filegroup_2>, <filegroup_3>, <filegroup_4>,
<filegroup_5>, <filegroup_6>, <filegroup_7>, <filegroup_8>,
<filegroup_9>, <filegroup_10>, <filegroup_11>, <filegroup_12> )

To verify the ranges in effect in each partition according to the function/scheme, run the
following query:

SQL

SELECT psch.name as PartitionScheme,


prng.value AS PartitionValue,
prng.boundary_id AS BoundaryID
FROM sys.partition_functions AS pfun
INNER JOIN sys.partition_schemes psch ON pfun.function_id =
psch.function_id
INNER JOIN sys.partition_range_values prng ON
prng.function_id=pfun.function_id
WHERE pfun.name = <DatetimeFieldPFN>

3. Create a partition table


Create partitioned table(s) according to your data schema, and specify the partition
scheme and constraint field used to partition the table, for example:

SQL

CREATE TABLE <table_name> ( [include schema definition here] )


ON <TablePScheme>(<partition_field>)

For more information, see Create Partitioned Tables and Indexes.

Bulk import the data for each individual


partition table
You may use BCP, BULK INSERT, or other methods such as Microsoft Data
Migration . The example provided uses the BCP method.

Alter the database to change transaction logging scheme to BULK_LOGGED to


minimize overhead of logging, for example:

SQL
ALTER DATABASE <database_name> SET RECOVERY BULK_LOGGED

To expedite data loading, launch the bulk import operations in parallel. For tips on
expediting bulk importing of big data into SQL Server databases, see Load 1 TB in
less than 1 hour.

The following PowerShell script is an example of parallel data loading using BCP.

PowerShell

# Set database name, input data directory, and output log directory
# This example loads comma-separated input data files
# The example assumes the partitioned data files are named as
<base_file_name>_<partition_number>.csv
# Assumes the input data files include a header line. Loading starts at line
number 2.

$dbname = "<database_name>"
$indir = "<path_to_data_files>"
$logdir = "<path_to_log_directory>"

# Select authentication mode


$sqlauth = 0

# For SQL authentication, set the server and user credentials


$sqlusr = "<user@server>"
$server = "<tcp:serverdns>"
$pass = "<password>"

# Set number of partitions per table - Should match the number of input data
files per table
$numofparts = <number_of_partitions>

# Set table name to be loaded, basename of input data files, input format
file, and number of partitions
$tbname = "<table_name>"
$basename = "<base_input_data_filename_no_extension>"
$fmtfile = "<full_path_to_format_file>"

# Create log directory if it does not exist


New-Item -ErrorAction Ignore -ItemType directory -Path $logdir

# BCP example using Windows authentication


$ScriptBlock1 = {
param($dbname, $tbname, $basename, $fmtfile, $indir, $logdir, $num)
bcp ($dbname + ".." + $tbname) in ($indir + "\" + $basename + "_" + $num
+ ".csv") -o ($logdir + "\" + $tbname + "_" + $num + ".txt") -h "TABLOCK" -F
2 -C "RAW" -f ($fmtfile) -T -b 2500 -t "," -r \n
}

# BCP example using SQL authentication


$ScriptBlock2 = {
param($dbname, $tbname, $basename, $fmtfile, $indir, $logdir, $num,
$sqlusr, $server, $pass)
bcp ($dbname + ".." + $tbname) in ($indir + "\" + $basename + "_" + $num
+ ".csv") -o ($logdir + "\" + $tbname + "_" + $num + ".txt") -h "TABLOCK" -F
2 -C "RAW" -f ($fmtfile) -U $sqlusr -S $server -P $pass -b 2500 -t "," -r \n
}

# Background processing of all partitions


for ($i=1; $i -le $numofparts; $i++)
{
Write-Output "Submit loading trip and fare partitions # $i"
if ($sqlauth -eq 0) {
# Use Windows authentication
Start-Job -ScriptBlock $ScriptBlock1 -Arg ($dbname, $tbname,
$basename, $fmtfile, $indir, $logdir, $i)
}
else {
# Use SQL authentication
Start-Job -ScriptBlock $ScriptBlock2 -Arg ($dbname, $tbname,
$basename, $fmtfile, $indir, $logdir, $i, $sqlusr, $server, $pass)
}
}

Get-Job

# Optional - Wait till all jobs complete and report date and time
date
While (Get-Job -State "Running") { Start-Sleep 10 }
date

Create indexes to optimize joins and query


performance
If you extract data for modeling from multiple tables, create indexes on the join
keys to improve the join performance.

Create indexes (clustered or non-clustered) targeting the same filegroup for each
partition, for example:

SQL

CREATE CLUSTERED INDEX <table_idx> ON <table_name>( [include index


columns here] )
ON <TablePScheme>(<partition)field>)

-- or,

CREATE INDEX <table_idx> ON <table_name>( [include index columns here]


)
ON <TablePScheme>(<partition)field>)

7 Note

You may choose to create the indexes before bulk importing the data. Index
creation before bulk importing slows down the data loading.

Advanced Analytics Process and Technology in


Action Example
For an end-to-end walkthrough example using the Team Data Science Process with a
public dataset, see Team Data Science Process in Action: using SQL Server.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Move data from a SQL Server database
to SQL Database with Azure Data
Factory
Article • 07/05/2023

This article shows how to move data from a SQL Server database to Azure SQL Database
via Azure Blob Storage using the Azure Data Factory (ADF): this method is a supported
legacy approach that has the advantages of a replicated staging copy, though we
suggest to look at our data migration page for the latest options .

For a table that summarizes various options for moving data to an Azure SQL Database,
see Move data to an Azure SQL Database for Azure Machine Learning.

Introduction: What is ADF and when should it


be used to migrate data?
Azure Data Factory is a fully managed cloud-based data integration service that
orchestrates and automates the movement and transformation of data. The key concept
in the ADF model is pipeline. A pipeline is a logical grouping of Activities, each of which
defines the actions to perform on the data contained in Datasets. Linked services are
used to define the information needed for Data Factory to connect to the data
resources.

With ADF, existing data processing services can be composed into data pipelines that
are highly available and managed in the cloud. These data pipelines can be scheduled to
ingest, prepare, transform, analyze, and publish data, and ADF manages and
orchestrates the complex data and processing dependencies. Solutions can be quickly
built and deployed in the cloud, connecting a growing number of on-premises and
cloud data sources.

Consider using ADF:

when data needs to be continually migrated in a hybrid scenario that accesses


both on-premises and cloud resources
when the data needs transformations or have business logic added to it when
being migrated.

ADF allows for the scheduling and monitoring of jobs using simple JSON scripts that
manage the movement of data on a periodic basis. ADF also has other capabilities such
as support for complex operations. For more information on ADF, see the
documentation at Azure Data Factory (ADF) .

The Scenario
We set up an ADF pipeline that composes two data migration activities. Together they
move data on a daily basis between a SQL Server database and Azure SQL Database.
The two activities are:

Copy data from a SQL Server database to an Azure Blob Storage account
Copy data from the Azure Blob Storage account to Azure SQL Database.

7 Note

The steps shown here have been adapted from the more detailed tutorial provided
by the ADF team: Copy data from a SQL Server database to Azure Blob storage
References to the relevant sections of that topic are provided when appropriate.

Prerequisites
This tutorial assumes you have:

An Azure subscription. If you do not have a subscription, you can sign up for a
free trial .
An Azure storage account. You use an Azure storage account for storing the data
in this tutorial. If you don't have an Azure storage account, see the Create a
storage account article. After you have created the storage account, you need to
obtain the account key used to access the storage. See Manage storage account
access keys.
Access to an Azure SQL Database. If you must set up an Azure SQL Database, the
topic Getting Started with Microsoft Azure SQL Database provides information on
how to provision a new instance of an Azure SQL Database.
Installed and configured Azure PowerShell locally. For instructions, see How to
install and configure Azure PowerShell.

7 Note

This procedure uses the Azure portal .


Upload the data to your SQL Server instance
We use the NYC Taxi dataset to demonstrate the migration process. The NYC Taxi
dataset is available, as noted in that post, on Azure blob storage NYC Taxi Data . The
data has two files, the trip_data.csv file, which contains trip details, and the trip_far.csv
file, which contains details of the fare paid for each trip. A sample and description of
these files are provided in NYC Taxi Trips Dataset Description.

You can either adapt the procedure provided here to a set of your own data or follow
the steps as described by using the NYC Taxi dataset. To upload the NYC Taxi dataset
into your SQL Server database, follow the procedure outlined in Bulk Import Data into
SQL Server database.

Create an Azure Data Factory


The instructions for creating a new Azure Data Factory and a resource group in the
Azure portal are provided Create an Azure Data Factory. Name the new ADF instance
adfdsp and name the resource group created adfdsprg.

Install and configure Azure Data Factory


Integration Runtime
The Integration Runtime is a customer-managed data integration infrastructure used by
Azure Data Factory to provide data integration capabilities across different network
environments. This runtime was formerly called "Data Management Gateway".

To set up, follow the instructions for creating a pipeline

Create linked services to connect to the data


resources
A linked service defines the information needed for Azure Data Factory to connect to a
data resource. We have three resources in this scenario for which linked services are
needed:

1. On-premises SQL Server


2. Azure Blob Storage
3. Azure SQL Database
The step-by-step procedure for creating linked services is provided in Create linked
services.

Define and create tables to specify how to


access the datasets
Create tables that specify the structure, location, and availability of the datasets with the
following script-based procedures. JSON files are used to define the tables. For more
information on the structure of these files, see Datasets.

7 Note

You should execute the Add-AzureAccount cmdlet before executing the New-
AzureDataFactoryTable cmdlet to confirm that the right Azure subscription is
selected for the command execution. For documentation of this cmdlet, see Add-
AzureAccount.

The JSON-based definitions in the tables use the following names:

the table name in the SQL Server is nyctaxi_data


the container name in the Azure Blob Storage account is containername

Three table definitions are needed for this ADF pipeline:

1. SQL on-premises Table


2. Blob Table
3. SQL Azure Table

7 Note

These procedures use Azure PowerShell to define and create the ADF activities. But
these tasks can also be accomplished using the Azure portal. For details, see Create
datasets.

SQL on-premises Table


The table definition for the SQL Server is specified in the following JSON file:

JSON
{
"name": "OnPremSQLTable",
"properties":
{
"location":
{
"type": "OnPremisesSqlServerTableLocation",
"tableName": "nyctaxi_data",
"linkedServiceName": "adfonpremsql"
},
"availability":
{
"frequency": "Day",
"interval": 1,
"waitOnExternal":
{
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}

The column names were not included here. You can subselect on the column names by
including them here (for details check the ADF documentation topic.

Copy the JSON definition of the table into a file called onpremtabledef.json file and save
it to a known location (here assumed to be C:\temp\onpremtabledef.json). Create the
table in ADF with the following Azure PowerShell cmdlet:

Azure PowerShell

New-AzureDataFactoryTable -ResourceGroupName ADFdsprg -DataFactoryName


ADFdsp –File C:\temp\onpremtabledef.json

Blob Table
Definition for the table for the output blob location is in the following (this maps the
ingested data from on-premises to Azure blob):

JSON

{
"name": "OutputBlobTable",
"properties":
{
"location":
{
"type": "AzureBlobLocation",
"folderPath": "containername",
"format":
{
"type": "TextFormat",
"columnDelimiter": "\t"
},
"linkedServiceName": "adfds"
},
"availability":
{
"frequency": "Day",
"interval": 1
}
}
}

Copy the JSON definition of the table into a file called bloboutputtabledef.json file and
save it to a known location (here assumed to be C:\temp\bloboutputtabledef.json). Create
the table in ADF with the following Azure PowerShell cmdlet:

Azure PowerShell

New-AzureDataFactoryTable -ResourceGroupName adfdsprg -DataFactoryName


adfdsp -File C:\temp\bloboutputtabledef.json

SQL Azure Table


Definition for the table for the SQL Azure output is in the following (this schema maps
the data coming from the blob):

JSON

{
"name": "OutputSQLAzureTable",
"properties":
{
"structure":
[
{ "name": "column1", "type": "String"},
{ "name": "column2", "type": "String"}
],
"location":
{
"type": "AzureSqlTableLocation",
"tableName": "your_db_name",
"linkedServiceName": "adfdssqlazure_linked_servicename"
},
"availability":
{
"frequency": "Day",
"interval": 1
}
}
}

Copy the JSON definition of the table into a file called AzureSqlTable.json file and save it
to a known location (here assumed to be C:\temp\AzureSqlTable.json). Create the table
in ADF with the following Azure PowerShell cmdlet:

Azure PowerShell

New-AzureDataFactoryTable -ResourceGroupName adfdsprg -DataFactoryName


adfdsp -File C:\temp\AzureSqlTable.json

Define and create the pipeline


Specify the activities that belong to the pipeline and create the pipeline with the
following script-based procedures. A JSON file is used to define the pipeline properties.

The script assumes that the pipeline name is AMLDSProcessPipeline.


Also note that we set the periodicity of the pipeline to be executed on daily basis
and use the default execution time for the job (12 am UTC).

7 Note

The following procedures use Azure PowerShell to define and create the ADF
pipeline. But this task can also be accomplished using the Azure portal. For details,
see Create pipeline.

Using the table definitions provided previously, the pipeline definition for the ADF is
specified as follows:

JSON

{
"name": "AMLDSProcessPipeline",
"properties":
{
"description" : "This pipeline has two activities: the first one
copies data from SQL Server to Azure Blob, and the second one copies from
Azure Blob to Azure Database Table",
"activities":
[
{
"name": "CopyFromSQLtoBlob",
"description": "Copy data from SQL Server to blob",
"type": "CopyActivity",
"inputs": [ {"name": "OnPremSQLTable"} ],
"outputs": [ {"name": "OutputBlobTable"} ],
"transformation":
{
"source":
{
"type": "SqlSource",
"sqlReaderQuery": "select * from nyctaxi_data"
},
"sink":
{
"type": "BlobSink"
}
},
"Policy":
{
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 0,
"timeout": "01:00:00"
}
},
{
"name": "CopyFromBlobtoSQLAzure",
"description": "Push data to Sql Azure",
"type": "CopyActivity",
"inputs": [ {"name": "OutputBlobTable"} ],
"outputs": [ {"name": "OutputSQLAzureTable"} ],
"transformation":
{
"source":
{
"type": "BlobSource"
},
"sink":
{
"type": "SqlSink",
"WriteBatchTimeout": "00:5:00",
}
},
"Policy":
{
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 2,
"timeout": "02:00:00"
}
}
]
}
}

Copy this JSON definition of the pipeline into a file called pipelinedef.json file and save it
to a known location (here assumed to be C:\temp\pipelinedef.json). Create the pipeline in
ADF with the following Azure PowerShell cmdlet:

Azure PowerShell

New-AzureDataFactoryPipeline -ResourceGroupName adfdsprg -DataFactoryName


adfdsp -File C:\temp\pipelinedef.json

Start the Pipeline


The pipeline can now be run using the following command:

Azure PowerShell

Set-AzureDataFactoryPipelineActivePeriod -ResourceGroupName ADFdsprg -


DataFactoryName ADFdsp -StartDateTime startdateZ –EndDateTime enddateZ –Name
AMLDSProcessPipeline

The startdate and enddate parameter values need to be replaced with the actual dates
between which you want the pipeline to run.

Once the pipeline executes, you should be able to see the data show up in the container
selected for the blob, one file per day.

We have not leveraged the functionality provided by ADF to pipe data incrementally. For
more information on how to do this and other capabilities provided by ADF, see the
ADF documentation .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Prepare data for enhanced machine
learning
Article • 01/06/2023

Pre-processing and cleaning data are important tasks that must be conducted before a
dataset can be used for model training. Raw data is often noisy and unreliable, and may
be missing values. Using such data for modeling can produce misleading results. These
tasks are part of the Team Data Science Process (TDSP) and typically follow an initial
exploration of a dataset used to discover and plan the pre-processing required. For
more detailed instructions on the TDSP process, see the steps outlined in the Team Data
Science Process.

Pre-processing and cleaning tasks, like the data exploration task, can be carried out in a
wide variety of environments, such as SQL or Hive or Azure Machine Learning Studio
(classic), and with various tools and languages, such as R or Python, depending where
your data is stored and how it is formatted. Since TDSP is iterative in nature, these tasks
can take place at various steps in the workflow of the process.

This article introduces various data processing concepts and tasks that can be
undertaken either before or after ingesting data into Azure Machine Learning Studio
(classic).

For an example of data exploration and pre-processing done inside Azure Machine
Learning Studio (classic), see the Pre-processing data video.

Why pre-process and clean data?


Real world data is gathered from various sources and processes and it may contain
irregularities or corrupt data compromising the quality of the dataset. The typical data
quality issues that arise are:

Incomplete: Data lacks attributes or containing missing values.


Noisy: Data contains erroneous records or outliers.
Inconsistent: Data contains conflicting records or discrepancies.

Quality data is a prerequisite for quality predictive models. To avoid "garbage in,
garbage out" and improve data quality and therefore model performance, it is
imperative to conduct a data health screen to spot data issues early and decide on the
corresponding data processing and cleaning steps.
What are some typical data health screens that
are employed?
We can check the general quality of data by checking:

The number of records.


The number of attributes (or features).
The attribute data types (nominal, ordinal, or continuous).
The number of missing values.
Well-formed data.
If the data is in TSV or CSV, check that the column separators and line
separators always correctly separate columns and lines.
If the data is in HTML or XML format, check whether the data is well formed
based on their respective standards.
Parsing may also be necessary in order to extract structured information from
semi-structured or unstructured data.
Inconsistent data records. Check the range of values are allowed. For example, if
the data contains student GPA (grade point average), check if the GPA is in the
designated range, say 0~4.

When you find issues with data, processing steps are necessary, which often involves
cleaning missing values, data normalization, discretization, text processing to remove
and/or replace embedded characters that may affect data alignment, mixed data types
in common fields, and others.

Azure Machine Learning consumes well-formed tabular data. If the data is already in
tabular form, data pre-processing can be performed directly with Azure Machine
Learning Studio (classic) in the Machine Learning. If data is not in tabular form, say it is
in XML, parsing may be required in order to convert the data to tabular form.

What are some of the major tasks in data pre-


processing?
Data cleaning: Fill in missing values, detect, and remove noisy data and outliers.
Data transformation: Normalize data to reduce dimensions and noise.
Data reduction: Sample data records or attributes for easier data handling.
Data discretization: Convert continuous attributes to categorical attributes for
ease of use with certain machine learning methods.
Text cleaning: remove embedded characters that may cause data misalignment,
for example, embedded tabs in a tab-separated data file, embedded new lines that
may break records, for example.

The sections below detail some of these data processing steps.

How to deal with missing values?


To deal with missing values, it is best to first identify the reason for the missing values to
better handle the problem. Typical missing value handling methods are:

Deletion: Remove records with missing values


Dummy substitution: Replace missing values with a dummy value: e.g, unknown
for categorical or 0 for numerical values.
Mean substitution: If the missing data is numerical, replace the missing values with
the mean.
Frequent substitution: If the missing data is categorical, replace the missing values
with the most frequent item
Regression substitution: Use a regression method to replace missing values with
regressed values.

How to normalize data?


Data normalization rescales numerical values to a specified range. Popular data
normalization methods include:

Min-Max Normalization: Linearly transform the data to a range, say between 0


and 1, where the min value is scaled to 0 and max value to 1.
Z-score Normalization: Scale data based on mean and standard deviation: divide
the difference between the data and the mean by the standard deviation.
Decimal scaling: Scale the data by moving the decimal point of the attribute value.

How to discretize data?


Data can be discretized by converting continuous values to nominal attributes or
intervals. Some ways of doing this are:

Equal-Width Binning: Divide the range of all possible values of an attribute into N
groups of the same size, and assign the values that fall in a bin with the bin
number.
Equal-Height Binning: Divide the range of all possible values of an attribute into N
groups, each containing the same number of instances, then assign the values that
fall in a bin with the bin number.
How to reduce data?
There are various methods to reduce data size for easier data handling. Depending on
data size and the domain, the following methods can be applied:

Record Sampling: Sample the data records and only choose the representative
subset from the data.
Attribute Sampling: Select only a subset of the most important attributes from the
data.
Aggregation: Divide the data into groups and store the numbers for each group.
For example, the daily revenue numbers of a restaurant chain over the past 20
years can be aggregated to monthly revenue to reduce the size of the data.

How to clean text data?


Text fields in tabular data may include characters that affect columns alignment and/or
record boundaries. For example, embedded tabs in a tab-separated file cause column
misalignment, and embedded new line characters break record lines. Improper text
encoding handling while writing or reading text leads to information loss, inadvertent
introduction of unreadable characters (like nulls), and may also affect text parsing.
Careful parsing and editing may be required in order to clean text fields for proper
alignment and/or to extract structured data from unstructured or semi-structured text
data.

Data exploration offers an early view into the data. A number of data issues can be
uncovered during this step and corresponding methods can be applied to address those
issues. It is important to ask questions such as what is the source of the issue and how
the issue may have been introduced. This process also helps you decide on the data
processing steps that need to be taken to resolve them. Identifying the final use cases
and personas can also be used to prioritize the data processing effort.

References
Data Mining: Concepts and Techniques, Third Edition, Morgan Kaufmann, 2011,
Jiawei Han, Micheline Kamber, and Jian Pei

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Preprocess large datasets with Azure Machine Learning
Azure Machine Learning Studio
What is Azure Machine Learning?

Related resources
Explore data in the Team Data Science Process
Sample data in Azure blob containers, SQL Server, and Hive tables
Process Azure Blob Storage data with advanced analytics
What is the Team Data Science Process?
Explore data in the Team Data Science
Process
Article • 11/15/2022

Exploring data is a step in the Team Data Science Process.

The following articles describe how to explore data in three different storage
environments that are typically used in the Data Science Process:

Explore Azure blob container data using the Pandas Python package.
Explore SQL Server data by using SQL and by using a programming language like
Python.
Explore Hive table data using Hive queries.

The Azure Machine Learning Resources provide documentation and videos on getting
started with Azure Machine Learning.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Explore data in Azure Blob storage with
the pandas Python package
Article • 05/30/2023

This article covers how to explore data that is stored in Azure blob container using the
pandas Python package.

This task is a step in the Team Data Science Process.

Prerequisites
This article assumes that you have:

Created an Azure storage account. If you need instructions, see Create an Azure
Storage account
Stored your data in an Azure Blob storage account. If you need instructions, see
Moving data to and from Azure Storage

Load the data into a pandas DataFrame


To explore and manipulate a dataset, it must first be downloaded from the blob source
to a local file, which can then be loaded in a pandas DataFrame. Here are the steps to
follow for this procedure:

1. Download the data from Azure blob with the following Python code sample using
Blob service. Replace the variable in the following code with your specific values:

Python

from azure.storage.blob import BlobServiceClient


import pandas as pd

STORAGEACCOUNTURL= <storage_account_url>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob


t1=time.time()
blob_service_client_instance =
BlobServiceClient(account_url=STORAGEACCOUNTURL,
credential=STORAGEACCOUNTKEY)
blob_client_instance =
blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME,
snapshot=None)
with open(LOCALFILENAME, "wb") as my_blob:
blob_data = blob_client_instance.download_blob()
blob_data.readinto(my_blob)
t2=time.time()
print(("It takes %s seconds to download "+BLOBNAME) % (t2 - t1))

2. Read the data into a pandas DataFrame from the downloaded file.

Python

# LOCALFILE is the file path


dataframe_blobdata = pd.read_csv(LOCALFILENAME)

If you need more general information on reading from an Azure Storage Blob, look at
our documentation Azure Storage Blobs client library for Python.

Now you are ready to explore the data and generate features on this dataset.

Examples of data exploration using pandas


Here are a few examples of ways to explore data using pandas:

1. Inspect the number of rows and columns

Python

print('the size of the data is: %d rows and %d columns' %


dataframe_blobdata.shape)

2. Inspect the first or last few rows in the following dataset:

Python

dataframe_blobdata.head(10)

dataframe_blobdata.tail(10)

3. Check the data type each column was imported as using the following sample
code

Python

for col in dataframe_blobdata.columns:


print(dataframe_blobdata[col].name, ':\t',
dataframe_blobdata[col].dtype)

4. Check the basic stats for the columns in the data set as follows

Python

dataframe_blobdata.describe()

5. Look at the number of entries for each column value as follows

Python

dataframe_blobdata['<column_name>'].value_counts()

6. Count missing values versus the actual number of entries in each column using
the following sample code

Python

miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()


print(miss_num)

7. If you have missing values for a specific column in the data, you can drop them as
follows:

Python

dataframe_blobdata_noNA = dataframe_blobdata.dropna()
dataframe_blobdata_noNA.shape

Another way to replace missing values is with the mode function:

Python

dataframe_blobdata_mode = dataframe_blobdata.fillna(
{'<column_name>': dataframe_blobdata['<column_name>'].mode()[0]})

8. Create a histogram plot using variable number of bins to plot the distribution of a
variable

Python

dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')
np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)

9. Look at correlations between variables using a scatterplot or using the built-in


correlation function

Python

# relationship between column_a and column_b using scatter plot


plt.scatter(dataframe_blobdata['<column_a>'],
dataframe_blobdata['<column_b>'])

# correlation between column_a and column_b


dataframe_blobdata[['<column_a>', '<column_b>']].corr()

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Explore data in a SQL Server virtual
machine on Azure
Article • 05/30/2023

This article covers how to explore data that is stored in a SQL Server VM on Azure. Use
SQL or Python to examine the data.

This task is a step in the Team Data Science Process.

7 Note

The sample SQL statements in this document assume that data is in SQL Server. If it
isn't, refer to the cloud data science process map to learn how to move your data
to SQL Server.

Explore SQL data with SQL scripts


Here are a few sample SQL scripts that can be used to explore data stores in SQL Server.

1. Get the count of observations per day

SELECT CONVERT(date, <date_columnname>) as date, count(*) as c from


<tablename> group by CONVERT(date, <date_columnname>)

2. Get the levels in a categorical column

select distinct <column_name> from <databasename>

3. Get the number of levels in combination of two categorical columns

select <column_a>, <column_b>,count(*) from <tablename> group by <column_a>,


<column_b>

4. Get the distribution for numerical columns

select <column_name>, count(*) from <tablename> group by <column_name>

7 Note

For a practical example, you can use the NYC Taxi dataset and refer to the IPNB
titled NYC Data wrangling using IPython Notebook and SQL Server for an end-
to-end walk-through.

Explore SQL data with Python


Using Python to explore data and generate features when the data is in SQL Server is
similar to processing data in Azure blob using Python, as documented in Process Azure
Blob data in your data science environment. Load the data from the database into a
pandas DataFrame and then can be processed further. We document the process of
connecting to the database and loading the data into the DataFrame in this section.

The following connection string format can be used to connect to a SQL Server
database from Python using pyodbc (replace servername, dbname, username, and
password with your specific values):

Python

#Set up the SQL Azure connection


import pyodbc
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=
<dbname>;UID=<username>;PWD=<password>')

The Pandas library in Python provides a rich set of data structures and data analysis
tools for data manipulation for Python programming. The following code reads the
results returned from a SQL Server database into a Pandas data frame:

Python

# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select <columnname1>, <columnname2>... from
<tablename>''', conn)

Now you can work with the Pandas DataFrame as covered in the topic Process Azure
Blob data in your data science environment.

The Team Data Science Process in action


example
For an end-to-end walkthrough example of the Cortana Analytics Process using a public
dataset, see The Team Data Science Process in action: using SQL Server.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Explore data in Hive tables with Hive
queries
Article • 11/15/2022

This article provides sample Hive scripts that are used to explore data in Hive tables in
an HDInsight Hadoop cluster.

This task is a step in the Team Data Science Process.

Prerequisites
This article assumes that you have:

Created an Azure storage account. If you need instructions, see Create an Azure
Storage account
Provisioned a customized Hadoop cluster with the HDInsight service. If you need
instructions, see Customize Azure HDInsight Hadoop Clusters for Advanced
Analytics.
The data has been uploaded to Hive tables in Azure HDInsight Hadoop clusters. If
it has not, follow the instructions in Create and load data to Hive tables to upload
data to Hive tables first.
Enabled remote access to the cluster. If you need instructions, see Access the Head
Node of Hadoop Cluster.
If you need instructions on how to submit Hive queries, see How to Submit Hive
Queries

Example Hive query scripts for data exploration


1. Get the count of observations per partition SELECT <partitionfieldname>, count(*)
from <databasename>.<tablename> group by <partitionfieldname>;

2. Get the count of observations per day SELECT to_date(<date_columnname>),


count(*) from <databasename>.<tablename> group by to_date(<date_columnname>);

3. Get the levels in a categorical column


SELECT distinct <column_name> from <databasename>.<tablename>

4. Get the number of levels in combination of two categorical columns SELECT


<column_a>, <column_b>, count(*) from <databasename>.<tablename> group by
<column_a>, <column_b>

5. Get the distribution for numerical columns


SELECT <column_name>, count(*) from <databasename>.<tablename> group by

<column_name>

6. Extract records from joining two tables

HiveQL

SELECT
a.<common_columnname1> as <new_name1>,
a.<common_columnname2> as <new_name2>,
a.<a_column_name1> as <new_name3>,
a.<a_column_name2> as <new_name4>,
b.<b_column_name1> as <new_name5>,
b.<b_column_name2> as <new_name6>
FROM
(
SELECT <common_columnname1>,
<common_columnname2>,
<a_column_name1>,
<a_column_name2>,
FROM <databasename>.<tablename1>
) a
join
(
SELECT <common_columnname1>,
<common_columnname2>,
<b_column_name1>,
<b_column_name2>,
FROM <databasename>.<tablename2>
) b
ON a.<common_columnname1>=b.<common_columnname1> and a.
<common_columnname2>=b.<common_columnname2>

Additional query scripts for taxi trip data


scenarios
Examples of queries that are specific to NYC Taxi Trip Data scenarios are also provided in
GitHub repository . These queries already have data schema specified and are ready to
be submitted to run. The NYC Taxi Trip data is available through Azure Open Datasets or
from the source TLC Trip Record Data .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Sample data in Azure blob containers,
SQL Server, and Hive tables
Article • 11/15/2022

The following articles describe how to sample data that is stored in one of three
different Azure locations:

Azure blob container data is sampled by downloading it programmatically and


then sampling it with sample Python code.
SQL Server data is sampled using both SQL and the Python Programming
Language.
Hive table data is sampled using Hive queries.

This sampling task is a step in the Team Data Science Process (TDSP).

Why sample data?

If the dataset you plan to analyze is large, it's usually a good idea to down-sample the
data to reduce it to a smaller but representative and more manageable size. Downsizing
may facilitate data understanding, exploration, and feature engineering. This sampling
role in the Cortana Analytics Process is to enable fast prototyping of the data processing
functions and machine learning models.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Sample data in Azure Blob storage
Article • 05/30/2023

This article covers sampling data stored in Azure Blob storage by downloading it
programmatically and then sampling it using procedures written in Python.

Why sample your data? If the dataset you plan to analyze is large, it's usually a good
idea to down-sample the data to reduce it to a smaller but representative and more
manageable size. Sampling facilitates data understanding, exploration, and feature
engineering. Its role in the Cortana Analytics Process is to enable fast prototyping of the
data processing functions and machine learning models.

This sampling task is a step in the Team Data Science Process (TDSP).

Download and down-sample data


1. Download the data from Azure Blob storage using the Blob service from the
following sample Python code:

Python

from azure.storage.blob import BlobService


import tables

STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob


t1=time.time()
blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=ST
ORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
t2=time.time()
print(("It takes %s seconds to download "+blobname) % (t2 - t1))

2. Read data into a Pandas data-frame from the file downloaded above.

Python

import pandas as pd

#directly ready from file on disk


dataframe_blobdata = pd.read_csv(LOCALFILE)
3. Down-sample the data using the numpy 's random.choice as follows:

Python

# A 1 percent sample
sample_ratio = 0.01
sample_size = np.round(dataframe_blobdata.shape[0] * sample_ratio)
sample_rows = np.random.choice(dataframe_blobdata.index.values,
sample_size)
dataframe_blobdata_sample = dataframe_blobdata.ix[sample_rows]

Now you can work with the above data frame with the one Percent sample for further
exploration and feature generation.

Upload data and read it into Azure Machine


Learning
You can use the following sample code to down-sample the data and use it directly in
Azure Machine Learning:

1. Write the data frame to a local file

Python

dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t',
encoding='utf-8', index=False)

2. Upload the local file to an Azure blob using the following sample code:

Python

from azure.storage.blob import BlobService


import tables

STORAGEACCOUNTNAME= <storage_account_name>
LOCALFILENAME= <local_file_name>
STORAGEACCOUNTKEY= <storage_account_key>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account
_key=STORAGEACCOUNTKEY)
localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming
file is in current working directory
try:

#perform upload
output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,loc
alfileprocessed)

except:
print ("Something went wrong with uploading to the blob:"+
BLOBNAME)

3. Make a datastore in Azure Machine Learning which points to the Azure Blob
Storage. This link describes the concept of datastores and how to subsequently
make a dataset for use with Azure Machine Learning.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Sample data in SQL Server on Azure
Article • 05/30/2023

This article shows how to sample data stored in SQL Server on Azure using either SQL or
the Python programming language. It also shows how to move sampled data into Azure
Machine Learning by saving it to a file, uploading it to an Azure blob, and then reading
it into Azure Machine Learning Studio.

The Python sampling uses the pyodbc ODBC library to connect to SQL Server on
Azure and the Pandas library to do the sampling.

7 Note

The sample SQL code in this document assumes that the data is in a SQL Server on
Azure. If it is not, refer to Move data to SQL Server on Azure article for instructions
on how to move your data to SQL Server on Azure.

Why sample your data? If the dataset you plan to analyze is large, it's usually a good
idea to down-sample the data to reduce it to a smaller but representative and more
manageable size. Sampling facilitates data understanding, exploration, and feature
engineering. Its role in the Team Data Science Process (TDSP) is to enable fast
prototyping of the data processing functions and machine learning models.

This sampling task is a step in the Team Data Science Process (TDSP).

Using SQL
This section describes several methods using SQL to perform simple random sampling
against the data in the database. Choose a method based on your data size and its
distribution.

The following two items show how to use newid in SQL Server to perform the sampling.
The method you choose depends on how random you want the sample to be (pk_id in
the following sample code is assumed to be an autogenerated primary key).

1. Less strict random sample

SQL

select * from <table_name> where <primary_key> in


(select top 10 percent <primary_key> from <table_name> order by
newid())
2. More random sample

SQL

SELECT * FROM <table_name>


WHERE 0.1 >= CAST(CHECKSUM(NEWID(), <primary_key>) & 0x7fffffff AS
float)/ CAST (0x7fffffff AS int)

Tablesample can be used for sampling the data as well. This option may be a better
approach if your data size is large (assuming that data on different pages is not
correlated) and for the query to complete in a reasonable time.

SQL

SELECT *
FROM <table_name>
TABLESAMPLE (10 PERCENT)

7 Note

You can explore and generate features from this sampled data by storing it in a
new table

Connecting to Azure Machine Learning


You may directly use the sample queries above in Azure Machine Learning code
(perhaps a notebook, or code inserted into Designer). See this link for more dtails about
how to connect to storage with an Azure Machine Learning datastore.

Using the Python programming language


This section demonstrates using the pyodbc library to establish an ODBC connect to a
SQL server database in Python. The database connection string is as follows: (replace
servername, dbname, username, and password with your configuration):

Python

#Set up the SQL Azure connection


import pyodbc
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=
<dbname>;UID=<username>;PWD=<password>')
The Pandas library in Python provides a rich set of data structures and data analysis
tools for data manipulation for Python programming. The following code reads a 0.1%
sample of the data from a table in Azure SQL Database into a Pandas data:

Python

import pandas as pd

# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select column1, column2... from <table_name>
tablesample (0.1 percent)''', conn)

You can now work with the sampled data in the Pandas data frame.

Connecting to Azure Machine Learning


You can use the following sample code to save the down-sampled data to a file and
upload it to an Azure blob. The data in the blob can be directly read into an Azure
Machine Learning Experiment using the Import Data module. The steps are as follows:

1. Write the pandas data frame to a local file

Python

dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t',
encoding='utf-8', index=False)

2. Upload local file to Azure blob

Python

from azure.storage import BlobService


import tables

STORAGEACCOUNTNAME= <storage_account_name>
LOCALFILENAME= <local_file_name>
STORAGEACCOUNTKEY= <storage_account_key>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account
_key=STORAGEACCOUNTKEY)
localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming
file is in current working directory

try:

#perform upload
output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,loc
alfileprocessed)

except:
print ("Something went wrong with uploading blob:"+BLOBNAME)

3. This guide provides an overview of the next step to access data in Azure Machine
Learning through datastores and datasets.

The Team Data Science Process in Action


example
To walk through an example of the Team Data Science Process a using a public dataset,
see Team Data Science Process in Action: using SQL Server.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Sample data in Azure HDInsight Hive
tables
Article • 11/15/2022

This article describes how to down-sample data stored in Azure HDInsight Hive tables
using Hive queries to reduce it to a size more manageable for analysis. It covers three
popularly used sampling methods:

Uniform random sampling


Random sampling by groups
Stratified sampling

Why sample your data? If the dataset you plan to analyze is large, it's usually a good
idea to down-sample the data to reduce it to a smaller but representative and more
manageable size. Down-sampling facilitates data understanding, exploration, and
feature engineering. Its role in the Team Data Science Process is to enable fast
prototyping of the data processing functions and machine learning models.

This sampling task is a step in the Team Data Science Process (TDSP).

How to submit Hive queries


Hive queries can be submitted from the Hadoop Command-Line console on the head
node of the Hadoop cluster. Log into the head node of the Hadoop cluster, open the
Hadoop Command-Line console, and submit the Hive queries from there. For
instructions on submitting Hive queries in the Hadoop Command-Line console, see How
to Submit Hive Queries.

Uniform random sampling


Uniform random sampling means that each row in the data set has an equal chance of
being sampled. It can be implemented by adding an extra field rand() to the data set in
the inner "select" query, and in the outer "select" query that condition on that random
field.

Here is an example query:

Python

SET sampleRate=<sample rate, 0-1>;


select
field1, field2, …, fieldN
from
(
select
field1, field2, …, fieldN, rand() as samplekey
from <hive table name>
)a
where samplekey<='${hiveconf:sampleRate}'

Here, <sample rate, 0-1> specifies the proportion of records that the users want to
sample.

Random sampling by groups


When sampling categorical data, you may want to either include or exclude all of the
instances for some value of the categorical variable. This sort of sampling is called
"sampling by group". For example, if you have a categorical variable "State", which has
values such as NY, MA, CA, NJ, and PA, you want records from each state to be together,
whether they are sampled or not.

Here is an example query that samples by group:

Python

SET sampleRate=<sample rate, 0-1>;


select
b.field1, b.field2, …, b.catfield, …, b.fieldN
from
(
select
field1, field2, …, catfield, …, fieldN
from <table name>
)b
join
(
select
catfield
from
(
select
catfield, rand() as samplekey
from <table name>
group by catfield
)a
where samplekey<='${hiveconf:sampleRate}'
)c
on b.catfield=c.catfield
Stratified sampling
Random sampling is stratified with respect to a categorical variable when the samples
obtained have categorical values that are present in the same ratio as they were in the
parent population. Using the same example as above, suppose your data has the
following observations by states: NJ has 100 observations, NY has 60 observations, and
WA has 300 observations. If you specify the rate of stratified sampling to be 0.5, then
the sample obtained should have approximately 50, 30, and 150 observations of NJ, NY,
and WA respectively.

Here is an example query:

HiveQL

SET sampleRate=<sample rate, 0-1>;


select
field1, field2, field3, ..., fieldN, state
from
(
select
field1, field2, field3, ..., fieldN, state,
count(*) over (partition by state) as state_cnt,
rank() over (partition by state order by rand()) as state_rank
from <table name>
) a
where state_rank <= state_cnt*'${hiveconf:sampleRate}'

For information on more advanced sampling methods that are available in Hive, see
LanguageManual Sampling .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Access datasets with Python using the
Azure Machine Learning Python client
library
Article • 01/06/2023

The preview of Microsoft Azure Machine Learning Python client library can enable
secure access to your Azure Machine Learning datasets from a local Python environment
and enables the creation and management of datasets in a workspace.

This topic provides instructions on how to:

install the Machine Learning Python client library


access and upload datasets, including instructions on how to get authorization to
access Azure Machine Learning datasets from your local Python environment
access intermediate datasets from experiments
use the Python client library to enumerate datasets, access metadata, read the
contents of a dataset, create new datasets, and update existing datasets

Prerequisites
The Python client library has been tested under the following environments:

Windows, Mac, and Linux


Python 2.7 and 3.6+

It has a dependency on the following packages:

requests
python-dateutil
pandas

We recommend using a Python distribution such as Anaconda or Canopy , which


come with Python, IPython and the three packages listed above installed. Although
IPython is not strictly required, it is a great environment for manipulating and visualizing
data interactively.

How to install the Azure Machine Learning Python client


library
Install the Azure Machine Learning Python client library to complete the tasks outlined
in this topic. This library is available from the Python Package Index . To install it in
your Python environment, run the following command from your local Python
environment:

Console

pip install azureml

Alternatively, you can download and install from the sources on GitHub .

Console

python setup.py install

If you have git installed on your machine, you can use pip to install directly from the git
repository:

Console

pip install git+https://github.com/Azure/Azure-MachineLearning-


ClientLibrary-Python.git

Use code snippets to access datasets


The Python client library gives you programmatic access to your existing datasets from
experiments that have been run.

From the Azure Machine Learning Studio (classic) web interface, you can generate code
snippets that include all the necessary information to download and deserialize datasets
as pandas DataFrame objects on your local machine.

Security for data access


The code snippets provided by Azure Machine Learning Studio (classic) for use with the
Python client library includes your workspace ID and authorization token. These provide
full access to your workspace and must be protected, like a password.

For security reasons, the code snippet functionality is only available to users that have
their role set as Owner for the workspace. Your role is displayed in Azure Machine
Learning Studio (classic) on the USERS page under Settings.
If your role is not set as Owner, you can either request to be reinvited as an owner, or
ask the owner of the workspace to provide you with the code snippet.

To obtain the authorization token, you may choose one of these options:

Ask for a token from an owner. Owners can access their authorization tokens from
the Settings page of their workspace in Azure Machine Learning Studio (classic).
Select Settings from the left pane and click AUTHORIZATION TOKENS to see the
primary and secondary tokens. Although either the primary or the secondary
authorization tokens can be used in the code snippet, it is recommended that
owners only share the secondary authorization tokens.

Ask to be promoted to role of owner: a current owner of the workspace needs to


first remove you from the workspace then reinvite you to it as an owner.

Once developers have obtained the workspace ID and authorization token, they are able
to access the workspace using the code snippet regardless of their role.

Authorization tokens are managed on the AUTHORIZATION TOKENS page under


SETTINGS. You can regenerate them, but this procedure revokes access to the previous
token.

Access datasets from a local Python application


1. In Machine Learning Studio (classic), click DATASETS in the navigation bar on the
left.

2. Select the dataset you would like to access. You can select any of the datasets from
the MY DATASETS list or from the SAMPLES list.

3. From the bottom toolbar, click Generate Data Access Code. If the data is in a
format incompatible with the Python client library, this button is disabled.

4. Select the code snippet from the window that appears and copy it to your
clipboard.
5. Paste the code into the notebook of your local Python application.

Access intermediate datasets from Machine


Learning experiments
After an experiment is run in Machine Learning Studio (classic), it is possible to access
the intermediate datasets from the output nodes of modules. Intermediate datasets are
data that has been created and used for intermediate steps when a model tool has been
run.

Intermediate datasets can be accessed as long as the data format is compatible with the
Python client library.

The following formats are supported (constants for these formats are in the
azureml.DataTypeIds class):

PlainText
GenericCSV
GenericTSV
GenericCSVNoHeader
GenericTSVNoHeader

You can determine the format by hovering over a module output node. It is displayed
along with the node name, in a tooltip.

Some of the modules, such as the Split module, output to a format named Dataset ,
which is not supported by the Python client library.

You need to use a conversion module, such as Convert to CSV, to get an output into a
supported format.

The following steps show an example that creates an experiment, runs it and accesses
the intermediate dataset.

1. Create a new experiment.

2. Insert an Adult Census Income Binary Classification dataset module.

3. Insert a Split module, and connect its input to the dataset module output.
4. Insert a Convert to CSV module and connect its input to one of the Split module
outputs.

5. Save the experiment, run it, and wait for the job to finish.

6. Click the output node on the Convert to CSV module.

7. When the context menu appears, select Generate Data Access Code.

8. Select the code snippet and copy it to your clipboard from the window that
appears.
9. Paste the code in your notebook.

10. You can visualize the data using matplotlib. This displays in a histogram for the age
column:
Use the Machine Learning Python client library
to access, read, create, and manage datasets

Workspace
The workspace is the entry point for the Python client library. Provide the Workspace
class with your workspace ID and authorization token to create an instance:

Python

ws = Workspace(workspace_id='4c29e1adeba2e5a7cbeb0e4f4adfb4df',
authorization_token='f4f3ade2c6aefdb1afb043cd8bcf3daf')

Enumerate datasets
To enumerate all datasets in a given workspace:

Python

for ds in ws.datasets:
print(ds.name)

To enumerate just the user-created datasets:

Python

for ds in ws.user_datasets:
print(ds.name)
To enumerate just the example datasets:

Python

for ds in ws.example_datasets:
print(ds.name)

You can access a dataset by name (which is case-sensitive):

Python

ds = ws.datasets['my dataset name']

Or you can access it by index:

Python

ds = ws.datasets[0]

Metadata
Datasets have metadata, in addition to content. (Intermediate datasets are an exception
to this rule and do not have any metadata.)

Some metadata values are assigned by the user at creation time:

print(ds.name)
print(ds.description)

print(ds.family_id)
print(ds.data_type_id)

Others are values assigned by Azure ML:

print(ds.id)
print(ds.created_date)

print(ds.size)

See the SourceDataset class for more on the available metadata.

Read contents
The code snippets provided by Machine Learning Studio (classic) automatically
download and deserialize the dataset to a pandas DataFrame object. This is done with
the to_dataframe method:

Python

frame = ds.to_dataframe()

If you prefer to download the raw data, and perform the deserialization yourself, that is
an option. At the moment, this is the only option for formats such as 'ARFF', which the
Python client library cannot deserialize.

To read the contents as text:

Python

text_data = ds.read_as_text()

To read the contents as binary:

Python

binary_data = ds.read_as_binary()

You can also just open a stream to the contents:

Python

with ds.open() as file:


binary_data_chunk = file.read(1000)

Create a new dataset


The Python client library allows you to upload datasets from your Python program.
These datasets are then available for use in your workspace.

If you have your data in a pandas DataFrame, use the following code:

Python

from azureml import DataTypeIds

dataset = ws.datasets.add_from_dataframe(
dataframe=frame,
data_type_id=DataTypeIds.GenericCSV,
name='my new dataset',
description='my description'
)

If your data is already serialized, you can use:

Python

from azureml import DataTypeIds

dataset = ws.datasets.add_from_raw_data(
raw_data=raw_data,
data_type_id=DataTypeIds.GenericCSV,
name='my new dataset',
description='my description'
)

The Python client library is able to serialize a pandas DataFrame to the following formats
(constants for these are in the azureml.DataTypeIds class):

PlainText
GenericCSV
GenericTSV
GenericCSVNoHeader
GenericTSVNoHeader

Update an existing dataset


If you try to upload a new dataset with a name that matches an existing dataset, you
should get a conflict error.

To update an existing dataset, you first need to get a reference to the existing dataset:

Python

dataset = ws.datasets['existing dataset']

print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name) # 'existing dataset'
print(dataset.description) # 'data up to jan 2015'

Then use update_from_dataframe to serialize and replace the contents of the dataset on
Azure:

Python
dataset = ws.datasets['existing dataset']

dataset.update_from_dataframe(frame2)

print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name) # 'existing dataset'
print(dataset.description) # 'data up to jan 2015'

If you want to serialize the data to a different format, specify a value for the optional
data_type_id parameter.

Python

from azureml import DataTypeIds

dataset = ws.datasets['existing dataset']

dataset.update_from_dataframe(
dataframe=frame2,
data_type_id=DataTypeIds.GenericTSV,
)

print(dataset.data_type_id) # 'GenericTSV'
print(dataset.name) # 'existing dataset'
print(dataset.description) # 'data up to jan 2015'

You can optionally set a new description by specifying a value for the description
parameter.

Python

dataset = ws.datasets['existing dataset']

dataset.update_from_dataframe(
dataframe=frame2,
description='data up to feb 2015',
)

print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name) # 'existing dataset'
print(dataset.description) # 'data up to feb 2015'

You can optionally set a new name by specifying a value for the name parameter. From
now on, you'll retrieve the dataset using the new name only. The following code updates
the data, name, and description.

Python
dataset = ws.datasets['existing dataset']

dataset.update_from_dataframe(
dataframe=frame2,
name='existing dataset v2',
description='data up to feb 2015',
)

print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name) # 'existing dataset v2'
print(dataset.description) # 'data up to feb 2015'

print(ws.datasets['existing dataset v2'].name) # 'existing dataset v2'


print(ws.datasets['existing dataset'].name) # IndexError

The data_type_id , name and description parameters are optional and default to their
previous value. The dataframe parameter is always required.

If your data is already serialized, use update_from_raw_data instead of


update_from_dataframe . If you just pass in raw_data instead of dataframe , it works in a
similar way.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Explore and analyze data with Python
Azure ML Package client library for Python - version 1.2.0
Data collection and manipulation

Related resources
Explore data in the Team Data Science Process
Sample data in Azure blob containers, SQL Server, and Hive tables
Create features for data in SQL Server using SQL and Python
What is the Team Data Science Process?
Process Azure Blob Storage data with
advanced analytics
Article • 11/21/2022

This document covers exploring data and generating features from data stored in Azure
Blob Storage.

Load the data into a Pandas data frame


In order to explore and manipulate a dataset, it must be downloaded from the blob
source to a local file that can then be loaded in a Pandas data frame. Here are the steps
to follow for this procedure:

1. Download the data from Blob Storage with the following sample Python code
using Blob service. Replace the variable in the code below with your specific values:

Python

from azure.storage.blob import BlobService


import tables

STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob


t1=time.time()
blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=ST
ORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
t2=time.time()
print(("It takes %s seconds to download "+blobname) % (t2 - t1))

2. Read the data into a Pandas data-frame from the downloaded file.

Python

#LOCALFILE is the file path


dataframe_blobdata = pd.read_csv(LOCALFILE)

Now you are ready to explore the data and generate features on this dataset.
Data Exploration
Here are a few examples of ways to explore data using Pandas:

1. Inspect the number of rows and columns:

Python

print 'the size of the data is: %d rows and %d columns' %


dataframe_blobdata.shape

2. Inspect the first or last few rows in the dataset as below:

Python

dataframe_blobdata.head(10)

dataframe_blobdata.tail(10)

3. Check the data type each column was imported as using the following sample
code

Python

for col in dataframe_blobdata.columns:


print dataframe_blobdata[col].name, ':\t',
dataframe_blobdata[col].dtype

4. Check the basic stats for the columns in the data set as follows

Python

dataframe_blobdata.describe()

5. Look at the number of entries for each column value as follows

Python

dataframe_blobdata['<column_name>'].value_counts()

6. Count missing values versus the actual number of entries in each column using the
following sample code

Python
miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()
print miss_num

7. If you have missing values for a specific column in the data, you can drop them as
follows:

Python

dataframe_blobdata_noNA = dataframe_blobdata.dropna()
dataframe_blobdata_noNA.shape

Another way to replace missing values is with the mode function:

Python

dataframe_blobdata_mode =
dataframe_blobdata.fillna({'<column_name>':dataframe_blobdata['<column_
name>'].mode()[0]})

8. Create a histogram plot using variable number of bins to plot the distribution of a
variable:

Python

dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')

np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)

9. Look at correlations between variables using a scatterplot or using the built-in


correlation function:

Python

#relationship between column_a and column_b using scatter plot


plt.scatter(dataframe_blobdata['<column_a>'],
dataframe_blobdata['<column_b>'])

#correlation between column_a and column_b


dataframe_blobdata[['<column_a>', '<column_b>']].corr()

Feature Generation
We can generate features using Python as follows:
Indicator value-based Feature Generation
Categorical features can be created as follows:

1. Inspect the distribution of the categorical column:

Python

dataframe_blobdata['<categorical_column>'].value_counts()

2. Generate indicator values for each of the column values:

Python

#generate the indicator column


dataframe_blobdata_identity =
pd.get_dummies(dataframe_blobdata['<categorical_column>'],
prefix='<categorical_column>_identity')

3. Join the indicator column with the original data frame:

Python

#Join the dummy variables back to the original data frame


dataframe_blobdata_with_identity =
dataframe_blobdata.join(dataframe_blobdata_identity)

4. Remove the original variable itself:

Python

#Remove the original column rate_code in df1_with_dummy


dataframe_blobdata_with_identity.drop('<categorical_column>', axis=1,
inplace=True)

Binning Feature Generation


For generating binned features, we proceed as follows:

1. Add a sequence of columns to bin a numeric column:

Python

bins = [0, 1, 2, 4, 10, 40]


dataframe_blobdata_bin_id =
pd.cut(dataframe_blobdata['<numeric_column>'], bins)

2. Convert binning to a sequence of boolean variables

Python

dataframe_blobdata_bin_bool = pd.get_dummies(dataframe_blobdata_bin_id,
prefix='<numeric_column>')

3. Finally, Join the dummy variables back to the original data frame

Python

dataframe_blobdata_with_bin_bool =
dataframe_blobdata.join(dataframe_blobdata_bin_bool)

Writing data back to Blob Storage and


consuming in Azure Machine Learning
After you have explored the data and created the necessary features, you can upload
the data (sampled or featurized) to Blob Storage and consume it in Azure Machine
Learning using the following steps: Additional features can be created in the Azure
Machine Learning studio (classic) as well.

1. Write the data frame to local file

Python

dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t',
encoding='utf-8', index=False)

2. Upload the data to Blob Storage as follows:

Python

from azure.storage.blob import BlobService


import tables

STORAGEACCOUNTNAME= <storage_account_name>
LOCALFILENAME= <local_file_name>
STORAGEACCOUNTKEY= <storage_account_key>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account
_key=STORAGEACCOUNTKEY)
localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming
file is in current working directory

try:

#perform upload
output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,loc
alfileprocessed)

except:
print ("Something went wrong with uploading blob:"+BLOBNAME)

3. Now the data can be read from the blob using the Azure Machine Learning Import
Data module as shown in the screen below:

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Scalable data science with Azure Data
Lake
Article • 05/30/2023

This walkthrough shows how to use Azure Data Lake to do data exploration and binary
classification tasks on a sample of the NYC taxi trip and fare dataset. The sample shows
you how to predict whether or not a tip is paid by a fare. It walks you through the steps
of the Team Data Science Process, end-to-end, from data acquisition to model training.
Then it shows you how to deploy a web service that publishes the model.

Technologies
These technologies are used in this walkthrough.

Azure Data Lake Analytics


U-SQL and Visual Studio
Python
Azure Machine Learning
Scripts

Azure Data Lake Analytics


The Microsoft Azure Data Lake has all the capabilities required to make it easy for
data scientists to store data of any size, shape and speed, and to conduct data
processing, advanced analytics, and machine learning modeling with high scalability in a
cost-effective way. You pay on a per-job basis, only when data is actually being
processed. Azure Data Lake Analytics includes U-SQL, a language that blends the
declarative nature of SQL with the expressive power of C#. U-SQL then provides a
scalable distributed query capability. It enables you to process unstructured data by
applying schema on read. You can also insert custom logic and user-defined functions
(UDFs), and it includes extensibility to enable fine-grained control over how to execute
at scale. To learn more about the design philosophy behind U-SQL, see this Visual Studio
blog post .

Data Lake Analytics is also a key part of Cortana Analytics Suite. It works with Azure
Synapse Analytics, Power BI, and Data Factory. This combination gives you a complete
cloud big data and advanced analytics platform.
This walkthrough begins by describing how to install the prerequisites and resources
that you need to complete the data science process tasks. Then it outlines the data
processing steps using U-SQL and concludes by showing how to use Python and Hive
with Azure Machine Learning studio (classic) to build and deploy the predictive models.

U-SQL and Visual Studio


This walkthrough recommends using Visual Studio to edit U-SQL scripts to process the
dataset. The U-SQL scripts are described here and provided in a separate file. The
process includes ingesting, exploring, and sampling the data. It also shows how to run a
U-SQL scripted job from the Azure portal. Hive tables are created for the data in an
associated HDInsight cluster to facilitate the building and deployment of a binary
classification model in Azure Machine Learning studio.

Python
This walkthrough also contains a section that shows how to build and deploy a
predictive model using Python with Azure Machine Learning sStudio. It provides a
Jupyter Notebook with the Python scripts for the steps in this process. The notebook
includes code for some additional feature engineering steps and models construction
such as multiclass classification and regression modeling in addition to the binary
classification model outlined here. The regression task is to predict the amount of the
tip based on other tip features.

Azure Machine Learning


Azure Machine Learning studio (classic) is used to build and deploy the predictive
models using two approaches: first with Python scripts and then with Hive tables on an
HDInsight (Hadoop) cluster.

Scripts
Only the principal steps are outlined in this walkthrough. You can download the full U-
SQL script and Jupyter Notebook from GitHub .

Prerequisites
Before you begin these topics, you must have the following:

An Azure subscription. If you don't already have one, see Get Azure free trial .
[Recommended] Visual Studio 2013 or later. If you don't already have one of these
versions installed, you can download a free Community version from Visual Studio
Community .

7 Note

Instead of Visual Studio, you can also use the Azure portal to submit Azure Data
Lake queries. Instructions are provided on how to do so both with Visual Studio
and on the portal in the section titled Process data with U-SQL.

Prepare data science environment for Azure


Data Lake
To prepare the data science environment for this walkthrough, create the following
resources:

Azure Data Lake Storage (ADLS)


Azure Data Lake Analytics (ADLA)
Azure Blob storage account
Azure Machine Learning studio (classic) account
Azure Data Lake Tools for Visual Studio (Recommended)

This section provides instructions on how to create each of these resources. If you
choose to use Hive tables with Azure Machine Learning, instead of Python, to build a
model, you also need to provision an HDInsight (Hadoop) cluster. This alternative
procedure in described in the Option 2 section.

7 Note

The Azure Data Lake Store can be created either separately or when you create the
Azure Data Lake Analytics as the default storage. Instructions are referenced for
creating each of these resources separately, but the Data Lake storage account
need not be created separately.

Create an Azure Data Lake Storage


Create an ADLS from the Azure portal . For details, see Create an HDInsight cluster
with Data Lake Store using Azure portal. Be sure to set up the Cluster AAD Identity in
the DataSource blade of the Optional Configuration blade described there.
Create an Azure Data Lake Analytics account
Create an ADLA account from the Azure portal . For details, see Tutorial: get started
with Azure Data Lake Analytics using Azure portal.
Create an Azure Blob storage account
Create an Azure Blob storage account from the Azure portal . For details, see the
Create a storage account section in About Azure Storage accounts.

Set up an Azure Machine Learning studio (classic)


account
Sign up/into Azure Machine Learning studio (classic) from the Azure Machine Learning
studio page. Click on the Get started now button and then choose a "Free
Workspace" or "Standard Workspace". Now your are ready to create experiments in
Azure Machine Learning studio.

Install Azure Data Lake Tools [Recommended]


Install Azure Data Lake Tools for your version of Visual Studio from Azure Data Lake
Tools for Visual Studio .
After the installation finishes, open up Visual Studio. You should see the Data Lake tab
the menu at the top. Your Azure resources should appear in the left panel when you sign
into your Azure account.
The NYC Taxi Trips dataset
The data set used here is a publicly available dataset -- the NYC Taxi Trips dataset . The
NYC Taxi Trip data consists of about 20 GB of compressed CSV files (~48 GB
uncompressed), recording more than 173 million individual trips and the fares paid for
each trip. Each trip record includes the pickup and dropoff locations and times,
anonymized hack (driver's) license number, and the medallion (taxi's unique ID) number.
The data covers all trips in the year 2013 and is provided in the following two datasets
for each month:

The 'trip_data' CSV contains trip details, such as number of passengers, pickup and
dropoff points, trip duration, and trip length. Here are a few sample records:

medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropo
ff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup

_latitude,dropoff_longitude,dropoff_latitude

89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-
01 15:11:48,2013-01-01

15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-

06 00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-
05 18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-
07 23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-

07 23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868

The 'trip_fare' CSV contains details of the fare paid for each trip, such as payment type,
fare amount, surcharge and taxes, tips and tolls, and the total amount paid. Here are a
few sample records:

medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount,

surcharge, mta_tax, tip_amount, tolls_amount, total_amount


89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01

15:11:48,CSH,6.5,0,0.5,0,0,7
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06

00:18:35,CSH,6,0.5,0.5,0,0,7
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05

18:49:41,CSH,5.5,1,0.5,0,0,7
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07

23:54:15,CSH,5,0.5,0.5,0,0,6
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07

23:25:03,CSH,9.5,0.5,0.5,0,0,10.5

The unique key to join trip_data and trip_fare is composed of the following three fields:
medallion, hack_license and pickup_datetime. The raw CSV files can be accessed from an
Azure Storage blob. The U-SQL script for this join is in the Join trip and fare tables
section.

Process data with U-SQL


The data processing tasks illustrated in this section include ingesting, checking quality,
exploring, and sampling the data. How to join trip and fare tables is also shown. The
final section shows run a U-SQL scripted job from the Azure portal. Here are links to
each subsection:

Data ingestion: read in data from public blob


Data quality checks
Data exploration
Join trip and fare tables
Data sampling
Run U-SQL jobs

The U-SQL scripts are described here and provided in a separate file. You can download
the full U-SQL scripts from GitHub .

To execute U-SQL, Open Visual Studio, click File --> New --> Project, choose U-SQL
Project, name and save it to a folder.
7 Note

It's possible to use the Azure Portal to execute U-SQL instead of Visual Studio. You
can navigate to the Azure Data Lake Analytics resource on the portal and submit
queries directly as illustrated in the following figure:

Data Ingestion: Read in data from public blob


The location of the data in the Azure blob is referenced as
wasb://container_name@blob_storage_account_name.blob.core.windows.net/blob_na
me and can be extracted using Extractors.Csv(). Substitute your own container name
and storage account name in following scripts for
container_name@blob_storage_account_name in the wasb address. Since the file names
are in same format, it's possible to use trip_data_{*}.csv to read in all 12 trip files.

SQL

///Read in Trip data


@trip0 =
EXTRACT
medallion string,
hack_license string,
vendor_id string,
rate_code string,
store_and_fwd_flag string,
pickup_datetime string,
dropoff_datetime string,
passenger_count string,
trip_time_in_secs string,
trip_distance string,
pickup_longitude string,
pickup_latitude string,
dropoff_longitude string,
dropoff_latitude string
// This is reading 12 trip data from blob
FROM
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/nycta
xitrip/trip_data_{*}.csv"
USING Extractors.Csv();

Since there are headers in the first row, you need to remove the headers and change
column types into appropriate ones. You can either save the processed data to Azure
Data Lake Storage using
swebhdfs://data_lake_storage_name.azuredatalakestorage.net/folder_name/file_name
_ or to Azure Blob storage account using
wasb://container_name@blob_storage_account_name.blob.core.windows.net/blob_na
me.

SQL

// change data types


@trip =
SELECT
medallion,
hack_license,
vendor_id,
rate_code,
store_and_fwd_flag,
DateTime.Parse(pickup_datetime) AS pickup_datetime,
DateTime.Parse(dropoff_datetime) AS dropoff_datetime,
Int32.Parse(passenger_count) AS passenger_count,
Double.Parse(trip_time_in_secs) AS trip_time_in_secs,
Double.Parse(trip_distance) AS trip_distance,
(pickup_longitude==string.Empty ? 0: float.Parse(pickup_longitude)) AS
pickup_longitude,
(pickup_latitude==string.Empty ? 0: float.Parse(pickup_latitude)) AS
pickup_latitude,
(dropoff_longitude==string.Empty ? 0: float.Parse(dropoff_longitude)) AS
dropoff_longitude,
(dropoff_latitude==string.Empty ? 0: float.Parse(dropoff_latitude)) AS
dropoff_latitude
FROM @trip0
WHERE medallion != "medallion";

////output data to ADL


OUTPUT @trip
TO
"swebhdfs://data_lake_storage_name.azuredatalakestore.net/nyctaxi_folder/dem
o_trip.csv"
USING Outputters.Csv();

////Output data to blob


OUTPUT @trip
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
trip.csv"
USING Outputters.Csv();

Similarly you can read in the fare data sets. Right-click Azure Data Lake Storage, you can
choose to look at your data in Azure portal --> Data Explorer or File Explorer within
Visual Studio.
Data quality checks
After trip and fare tables have been read in, data quality checks can be done in the
following way. The resulting CSV files can be output to Azure Blob storage or Azure
Data Lake Storage.

Find the number of medallions and unique number of medallions:

SQL

///check the number of medallions and unique number of medallions


@trip2 =
SELECT
medallion,
vendor_id,
pickup_datetime.Month AS pickup_month
FROM @trip;

@ex_1 =
SELECT
pickup_month,
COUNT(medallion) AS cnt_medallion,
COUNT(DISTINCT(medallion)) AS unique_medallion
FROM @trip2
GROUP BY pickup_month;
OUTPUT @ex_1
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_1.csv"
USING Outputters.Csv();

Find those medallions that had more than 100 trips:

SQL

///find those medallions that had more than 100 trips


@ex_2 =
SELECT medallion,
COUNT(medallion) AS cnt_medallion
FROM @trip2
//where pickup_datetime >= "2013-01-01t00:00:00.0000000" and
pickup_datetime <= "2013-04-01t00:00:00.0000000"
GROUP BY medallion
HAVING COUNT(medallion) > 100;
OUTPUT @ex_2
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_2.csv"
USING Outputters.Csv();

Find those invalid records in terms of pickup_longitude:

SQL
///find those invalid records in terms of pickup_longitude
@ex_3 =
SELECT COUNT(medallion) AS cnt_invalid_pickup_longitude
FROM @trip
WHERE
pickup_longitude <- 90 OR pickup_longitude > 90;
OUTPUT @ex_3
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_3.csv"
USING Outputters.Csv();

Find missing values for some variables:

SQL

//check missing values


@res =
SELECT *,
(medallion == null? 1 : 0) AS missing_medallion
FROM @trip;

@trip_summary6 =
SELECT
vendor_id,
SUM(missing_medallion) AS medallion_empty,
COUNT(medallion) AS medallion_total,
COUNT(DISTINCT(medallion)) AS medallion_total_unique
FROM @res
GROUP BY vendor_id;
OUTPUT @trip_summary6
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_16.csv"
USING Outputters.Csv();

Data exploration
Do some data exploration with the following scripts to get a better understanding of the
data.

Find the distribution of tipped and non-tipped trips:

SQL

///tipped vs. not tipped distribution


@tip_or_not =
SELECT *,
(tip_amount > 0 ? 1: 0) AS tipped
FROM @fare;

@ex_4 =
SELECT tipped,
COUNT(*) AS tip_freq
FROM @tip_or_not
GROUP BY tipped;
OUTPUT @ex_4
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_4.csv"
USING Outputters.Csv();

Find the distribution of tip amount with cut-off values: 0, 5, 10, and 20 dollars.

SQL

//tip class/range distribution


@tip_class =
SELECT *,
(tip_amount >20? 4: (tip_amount >10? 3:(tip_amount >5 ? 2:
(tip_amount > 0 ? 1: 0)))) AS tip_class
FROM @fare;
@ex_5 =
SELECT tip_class,
COUNT(*) AS tip_freq
FROM @tip_class
GROUP BY tip_class;
OUTPUT @ex_5
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_5.csv"
USING Outputters.Csv();

Find basic statistics of trip distance:

SQL

// find basic statistics for trip_distance


@trip_summary4 =
SELECT
vendor_id,
COUNT(*) AS cnt_row,
MIN(trip_distance) AS min_trip_distance,
MAX(trip_distance) AS max_trip_distance,
AVG(trip_distance) AS avg_trip_distance
FROM @trip
GROUP BY vendor_id;
OUTPUT @trip_summary4
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_14.csv"
USING Outputters.Csv();

Find the percentiles of trip distance:

SQL

// find percentiles of trip_distance


@trip_summary3 =
SELECT DISTINCT vendor_id AS vendor,
PERCENTILE_DISC(0.25) WITHIN GROUP(ORDER BY
trip_distance) OVER(PARTITION BY vendor_id) AS median_trip_distance_disc,
PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY
trip_distance) OVER(PARTITION BY vendor_id) AS median_trip_distance_disc,
PERCENTILE_DISC(0.75) WITHIN GROUP(ORDER BY
trip_distance) OVER(PARTITION BY vendor_id) AS median_trip_distance_disc
FROM @trip;
// group by vendor_id;
OUTPUT @trip_summary3
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_13.csv"
USING Outputters.Csv();

Join trip and fare tables


Trip and fare tables can be joined by medallion, hack_license, and pickup_time.

SQL

//join trip and fare table

@model_data_full =
SELECT t.*,
f.payment_type, f.fare_amount, f.surcharge, f.mta_tax, f.tolls_amount,
f.total_amount, f.tip_amount,
(f.tip_amount > 0 ? 1: 0) AS tipped,
(f.tip_amount >20? 4: (f.tip_amount >10? 3:(f.tip_amount >5 ? 2:
(f.tip_amount > 0 ? 1: 0)))) AS tip_class
FROM @trip AS t JOIN @fare AS f
ON (t.medallion == f.medallion AND t.hack_license == f.hack_license AND
t.pickup_datetime == f.pickup_datetime)
WHERE (pickup_longitude != 0 AND dropoff_longitude != 0 );

//// output to blob


OUTPUT @model_data_full
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_7_full_data.csv"
USING Outputters.Csv();
////output data to ADL
OUTPUT @model_data_full
TO
"swebhdfs://data_lake_storage_name.azuredatalakestore.net/nyctaxi_folder/dem
o_ex_7_full_data.csv"
USING Outputters.Csv();

For each level of passenger count, calculate the number of records, average tip amount,
variance of tip amount, percentage of tipped trips.

SQL

// contingency table
@trip_summary8 =
SELECT passenger_count,
COUNT(*) AS cnt,
AVG(tip_amount) AS avg_tip_amount,
VAR(tip_amount) AS var_tip_amount,
SUM(tipped) AS cnt_tipped,
(float)SUM(tipped)/COUNT(*) AS pct_tipped
FROM @model_data_full
GROUP BY passenger_count;
OUTPUT @trip_summary8
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_17.csv"
USING Outputters.Csv();

Data sampling
First, randomly select 0.1% of the data from the joined table:

SQL

//random select 1/1000 data for modeling purpose


@addrownumberres_randomsample =
SELECT *,
ROW_NUMBER() OVER() AS rownum
FROM @model_data_full;

@model_data_random_sample_1_1000 =
SELECT *
FROM @addrownumberres_randomsample
WHERE rownum % 1000 == 0;

OUTPUT @model_data_random_sample_1_1000
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_7_random_1_1000.csv"
USING Outputters.Csv();

Then do stratified sampling by binary variable tip_class:

SQL

//stratified random select 1/1000 data for modeling purpose


@addrownumberres_stratifiedsample =
SELECT *,
ROW_NUMBER() OVER(PARTITION BY tip_class) AS rownum
FROM @model_data_full;

@model_data_stratified_sample_1_1000 =
SELECT *
FROM @addrownumberres_stratifiedsample
WHERE rownum % 1000 == 0;
//// output to blob
OUTPUT @model_data_stratified_sample_1_1000
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_9_stratified_1_1000.csv"
USING Outputters.Csv();
////output data to ADL
OUTPUT @model_data_stratified_sample_1_1000
TO
"swebhdfs://data_lake_storage_name.azuredatalakestore.net/nyctaxi_folder/dem
o_ex_9_stratified_1_1000.csv"
USING Outputters.Csv();

Run U-SQL jobs


After editing U-SQL scripts, you can submit them to the server using your Azure Data
Lake Analytics account. Click Data Lake, Submit Job, select your Analytics Account,
choose Parallelism, and click Submit button.
When the job is complied successfully, the status of your job is displayed in Visual
Studio for monitoring. After the job completes, you can even replay the job execution
process and find out the bottleneck steps to improve your job efficiency. You can also
go to Azure portal to check the status of your U-SQL jobs.
Now you can check the output files in either Azure Blob storage or Azure portal. Use the
stratified sample data for our modeling in the next step.

Build and deploy models in Azure Machine


Learning
Two options are available for you to pull data into Azure Machine Learning to build and

In the first option, you use the sampled data that has been written to an Azure
Blob (in the Data sampling step above) and use Python to build and deploy
models from Azure Machine Learning.
In the second option, you query the data in Azure Data Lake directly using a Hive
query. This option requires that you create a new HDInsight cluster or use an
existing HDInsight cluster where the Hive tables point to the NY Taxi data in Azure
Data Lake Storage. Both these options are discussed in the following sections.

Option 1: Use Python to build and deploy


machine learning models
To build and deploy machine learning models using Python, create a Jupyter Notebook
on your local machine or in Azure Machine Learning studio. The Jupyter Notebook
provided on GitHub contains the full code to explore, visualize data, feature
engineering, modeling, and deployment. In this article, just the modeling and
deployment are covered.

Import Python libraries


In order to run the sample Jupyter Notebook or the Python script file, you need the
following Python packages. If you're using the Azure Machine Learning Notebook
service, these packages have been pre-installed.

Python

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
from time import time
import pyodbc
import os
from azure.storage.blob import BlobService
import tables
import time
import zipfile
import random
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from __future__ import division
from sklearn import linear_model
from azureml import services

Read in the data from blob


Connection String

text

CONTAINERNAME = 'test1'
STORAGEACCOUNTNAME = 'XXXXXXXXX'
STORAGEACCOUNTKEY = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYY'
BLOBNAME = 'demo_ex_9_stratified_1_1000_copy.csv'
blob_service =
BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTK
EY)

Read in as text

text

t1 = time.time()
data =
blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).split("\n")
t2 = time.time()
print(("It takes %s seconds to read in "+BLOBNAME) % (t2 - t1))

Add column names and separate columns

text

colnames =
['medallion','hack_license','vendor_id','rate_code','store_and_fwd_flag
','pickup_datetime','dropoff_datetime',
'passenger_count','trip_time_in_secs','trip_distance','pickup_longitude
','pickup_latitude','dropoff_longitude','dropoff_latitude',
'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tolls_amount',
'total_amount', 'tip_amount', 'tipped', 'tip_class', 'rownum']
df1 = pd.DataFrame([sub.split(",") for sub in data], columns =
colnames)

Change some columns to numeric

text

cols_2_float =
['trip_time_in_secs','pickup_longitude','pickup_latitude','dropoff_long
itude','dropoff_latitude',
'fare_amount',
'surcharge','mta_tax','tolls_amount','total_amount','tip_amount',
'passenger_count','trip_distance'
,'tipped','tip_class','rownum']
for col in cols_2_float:
df1[col] = df1[col].astype(float)

Build machine learning models


Here you build a binary classification model to predict whether a trip is tipped or not. In
the Jupyter Notebook you can find other two models: multiclass classification, and
regression models.

First you need to create dummy variables that can be used in scikit-learn models

Python

df1_payment_type_dummy = pd.get_dummies(df1['payment_type'],
prefix='payment_type_dummy')
df1_vendor_id_dummy = pd.get_dummies(df1['vendor_id'],
prefix='vendor_id_dummy')

Create data frame for the modeling

Python

cols_to_keep = ['tipped', 'trip_distance', 'passenger_count']


data =
df1[cols_to_keep].join([df1_payment_type_dummy,df1_vendor_id_dummy])

X = data.iloc[:,1:]
Y = data.tipped

Training and testing 60-40 split

Python

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,


test_size=0.4, random_state=0)

Logistic Regression in training set

Python

model = LogisticRegression()
logit_fit = model.fit(X_train, Y_train)
print ('Coefficients: \n', logit_fit.coef_)
Y_train_pred = logit_fit.predict(X_train)
Score testing data set

Python

Y_test_pred = logit_fit.predict(X_test)

Calculate Evaluation metrics

Python

fpr_train, tpr_train, thresholds_train = metrics.roc_curve(Y_train,


Y_train_pred)
print fpr_train, tpr_train, thresholds_train

fpr_test, tpr_test, thresholds_test = metrics.roc_curve(Y_test,


Y_test_pred)
print fpr_test, tpr_test, thresholds_test

#AUC
print metrics.auc(fpr_train,tpr_train)
print metrics.auc(fpr_test,tpr_test)

#Confusion Matrix
print metrics.confusion_matrix(Y_train,Y_train_pred)
print metrics.confusion_matrix(Y_test,Y_test_pred)

Build Web Service API and consume it in Python


You want to operationalize the machine learning model after it has been built. The
binary logistic model is used here as an example. Make sure the scikit-learn version in
your local machine is 0.15.1 (Azure Machine Learning studio is already at least at this
version).

Find your workspace credentials from Azure Machine Learning studio (classic)
settings. In Azure Machine Learning studio, click Settings --> Name -->
Authorization Tokens.
Output

workspaceid = 'xxxxxxxxxxxxxxxxxxxxxxxxxxx'
auth_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxx'

Create Web Service

Python

@services.publish(workspaceid, auth_token)
@services.types(trip_distance = float, passenger_count = float,
payment_type_dummy_CRD = float, payment_type_dummy_CSH=float,
payment_type_dummy_DIS = float, payment_type_dummy_NOC = float,
payment_type_dummy_UNK = float, vendor_id_dummy_CMT = float,
vendor_id_dummy_VTS = float)
@services.returns(int) #0, or 1
def predictNYCTAXI(trip_distance, passenger_count,
payment_type_dummy_CRD, payment_type_dummy_CSH,payment_type_dummy_DIS,
payment_type_dummy_NOC, payment_type_dummy_UNK, vendor_id_dummy_CMT,
vendor_id_dummy_VTS ):
inputArray = [trip_distance, passenger_count,
payment_type_dummy_CRD, payment_type_dummy_CSH, payment_type_dummy_DIS,
payment_type_dummy_NOC, payment_type_dummy_UNK, vendor_id_dummy_CMT,
vendor_id_dummy_VTS]
return logit_fit.predict(inputArray)

Get web service credentials


Python

url = predictNYCTAXI.service.url
api_key = predictNYCTAXI.service.api_key

print url
print api_key

@services.service(url, api_key)
@services.types(trip_distance = float, passenger_count = float,
payment_type_dummy_CRD = float,
payment_type_dummy_CSH=float,payment_type_dummy_DIS = float,
payment_type_dummy_NOC = float, payment_type_dummy_UNK = float,
vendor_id_dummy_CMT = float, vendor_id_dummy_VTS = float)
@services.returns(float)
def NYCTAXIPredictor(trip_distance, passenger_count,
payment_type_dummy_CRD, payment_type_dummy_CSH,payment_type_dummy_DIS,
payment_type_dummy_NOC, payment_type_dummy_UNK, vendor_id_dummy_CMT,
vendor_id_dummy_VTS ):
pass

Call Web service API. Typically, wait 5-10 seconds after the previous step.

Python

NYCTAXIPredictor(1,2,1,0,0,0,0,0,1)

Option 2: Create and deploy models directly in


Azure Machine Learning
Azure Machine Learning studio (classic) can read data directly from Azure Data Lake
Storage and then be used to create and deploy models. This approach uses a Hive table
that points at the Azure Data Lake Storage. A separate Azure HDInsight cluster needs to
be provisioned for the Hive table.

Create an HDInsight Linux Cluster


Create an HDInsight Cluster (Linux) from the Azure portal . For details, see the Create
an HDInsight cluster with access to Azure Data Lake Storage section in Create an
HDInsight cluster with Data Lake Store using Azure portal.
Create Hive table in HDInsight
Now you create Hive tables to be used in Azure Machine Learning studio (classic) in the
HDInsight cluster using the data stored in Azure Data Lake Storage in the previous step.
Go to the HDInsight cluster created. Click Settings --> Properties --> Cluster AAD
Identity --> ADLS Access, make sure your Azure Data Lake Storage account is added in
the list with read, write, and execute rights.

Then click Dashboard next to the Settings button and a window pops up. Click Hive
View in the upper right corner of the page and you should see the Query Editor.
Paste in the following Hive scripts to create a table. The location of data source is in
Azure Data Lake Storage reference in this way:
adl://data_lake_store_name.azuredatalakestore.net:443/folder_name/file_name.

HiveQL

CREATE EXTERNAL TABLE nyc_stratified_sample


(
medallion string,
hack_license string,
vendor_id string,
rate_code string,
store_and_fwd_flag string,
pickup_datetime string,
dropoff_datetime string,
passenger_count string,
trip_time_in_secs string,
trip_distance string,
pickup_longitude string,
pickup_latitude string,
dropoff_longitude string,
dropoff_latitude string,
payment_type string,
fare_amount string,
surcharge string,
mta_tax string,
tolls_amount string,
total_amount string,
tip_amount string,
tipped string,
tip_class string,
rownum string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\n'
LOCATION
'adl://data_lake_storage_name.azuredatalakestore.net:443/nyctaxi_folder/demo
_ex_9_stratified_1_1000_copy.csv';

When the query completes, you should see the results like this:

Build and deploy models in Azure Machine Learning


studio
You're now ready to build and deploy a model that predicts whether or not a tip is paid
with Azure Machine Learning. The stratified sample data is ready to be used in this
binary classification (tip or not) problem. The predictive models using multiclass
classification (tip_class) and regression (tip_amount) can also be built and deployed with
Azure Machine Learning studio, but here it's only shown how to handle the case using
the binary classification model.

1. Get the data into Azure Machine Learning studio (classic) using the Import Data
module, available in the Data Input and Output section. For more information, see
the Import Data module reference page.

2. Select Hive Query as the Data source in the Properties panel.

3. Paste the following Hive script in the Hive database query editor

HiveQL

select * from nyc_stratified_sample;

4. Enter the URL of the HDInsight cluster (this URL can be found in the Azure portal),
then enter the Hadoop credentials, the location of the output data, and the Azure
Storage account name/key/container name.
An example of a binary classification experiment reading data from Hive table is shown
in the following figure:
After the experiment is created, click Set Up Web Service --> Predictive Web Service
Run the automatically created scoring experiment, when it finishes, click Deploy Web
Service
The web service dashboard displays shortly:
Summary
By completing this walkthrough, you've created a data science environment for building
scalable end-to-end solutions in Azure Data Lake. This environment was used to analyze
a large public dataset, taking it through the canonical steps of the Data Science Process,
from data acquisition through model training, and then to the deployment of the model
as a web service. U-SQL was used to process, explore, and sample the data. Python and
Hive were used with Azure Machine Learning studio (classic) to build and deploy
predictive models.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
The Team Data Science Process in action: using Azure Synapse Analytics
Overview of the Data Science Process using Spark on Azure HDInsight

Related resources
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Process data in a SQL Server virtual
machine on Azure
Article • 05/30/2023

This document covers how to explore data and generate features for data stored in a
SQL Server VM on Azure. This goal may be completed by data wrangling using SQL or
by using a programming language like Python.

7 Note

The sample SQL statements in this document assume that data is in SQL Server. If it
isn't, refer to the cloud data science process map to learn how to move your data
to SQL Server.

Using SQL
We describe the following data wrangling tasks in this section using SQL:

1. Data Exploration
2. Feature Generation

Data Exploration
Here are a few sample SQL scripts that can be used to explore data stores in SQL Server.

7 Note

For a practical example, you can use the NYC Taxi dataset and refer to the IPNB
titled NYC Data wrangling using IPython Notebook and SQL Server for an end-
to-end walk-through.

1. Get the count of observations per day

SELECT CONVERT(date, <date_columnname>) as date, count(*) as c from


<tablename> group by CONVERT(date, <date_columnname>)

2. Get the levels in a categorical column

select distinct <column_name> from <databasename>


3. Get the number of levels in combination of two categorical columns

select <column_a>, <column_b>,count(*) from <tablename> group by <column_a>,


<column_b>

4. Get the distribution for numerical columns

select <column_name>, count(*) from <tablename> group by <column_name>

Feature Generation
In this section, we describe ways of generating features using SQL:

1. Count based Feature Generation


2. Binning Feature Generation
3. Rolling out the features from a single column

7 Note

Once you generate additional features, you can either add them as columns to the
existing table or create a new table with the additional features and primary key,
that can be joined with the original table.

Count based Feature Generation


The following examples demonstrate two ways of generating count features. The first
method uses conditional sum and the second method uses the 'where' clause. These
results may then be joined with the original table (using primary key columns) to have
count features alongside the original data.

SQL

select <column_name1>,<column_name2>,<column_name3>, COUNT(*) as


Count_Features from <tablename> group by <column_name1>,<column_name2>,
<column_name3>

select <column_name1>,<column_name2> , sum(1) as Count_Features from


<tablename>
where <column_name3> = '<some_value>' group by <column_name1>,<column_name2>

Binning Feature Generation


The following example shows how to generate binned features by binning (using five
bins) a numerical column that can be used as a feature instead:

SQL

SELECT <column_name>, NTILE(5) OVER (ORDER BY <column_name>) AS BinNumber


from <tablename>

Rolling out the features from a single column


In this section, we demonstrate how to roll out a single column in a table to generate
additional features. The example assumes that there is a latitude or longitude column in
the table from which you are trying to generate features.

Here is a brief primer on latitude/longitude location data (resourced from stackoverflow


How to measure the accuracy of latitude and longitude? ). This guidance is useful to
understand before including location as one or more features:

The sign tells us whether we are north or south, east or west on the globe.
A nonzero hundreds digit tells us that we're using longitude, not latitude!
The tens digit gives a position to about 1,000 kilometers. It gives us useful
information about what continent or ocean we are on.
The units digit (one decimal degree) gives a position up to 111 kilometers (60
nautical miles, about 69 miles). It can tell you roughly what state, country, or region
you're in.
The first decimal place is worth up to 11.1 km: it can distinguish the position of one
large city from a neighboring large city.
The second decimal place is worth up to 1.1 km: it can separate one village from
the next.
The third decimal place is worth up to 110 m: it can identify a large agricultural
field or institutional campus.
The fourth decimal place is worth up to 11 m: it can identify a parcel of land. It is
comparable to the typical accuracy of an uncorrected GPS unit with no
interference.
The fifth decimal place is worth up to 1.1 m: it distinguishes trees from each other.
Accuracy to this level with commercial GPS units can only be achieved with
differential correction.
The sixth decimal place is worth up to 0.11 m: you can use this for laying out
structures in detail, for designing landscapes, building roads. It should be more
than good enough for tracking movements of glaciers and rivers. This can be
achieved by taking painstaking measures with GPS, such as differentially corrected
GPS.

The location information can be featurized as follows, separating out region, location,
and city information. You can also call a REST end point such as Bing Maps API available
at Find a Location by Point to get the region/district information.

SQL

select
<location_columnname>
,round(<location_columnname>,0) as l1
,l2=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 1 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),1,1) else '0' end
,l3=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 2 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),2,1) else '0' end
,l4=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 3 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),3,1) else '0' end
,l5=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 4 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),4,1) else '0' end
,l6=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 5 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),5,1) else '0' end
,l7=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 6 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),6,1) else '0' end
from <tablename>

These location-based features can be further used to generate additional count features
as described earlier.

 Tip

You can programmatically insert the records using your language of choice. You
may need to insert the data in chunks to improve write efficiency (for an example of
how to do this using pyodbc, see A HelloWorld sample to access SQLServer with
python ). Another alternative is to insert data in the database using the BCP
utility.
Connecting to Azure Machine Learning
The newly generated feature can be added as a column to an existing table or stored in
a new table and joined with the original table for machine learning. Features can be
generated or accessed if already created, using the Import Data module in Azure
Machine Learning as shown below:

Using a programming language like Python


Using Python to explore data and generate features when the data is in SQL Server is
similar to processing data in Azure blob using Python as documented in Process Azure
Blob data in your data science environment. Load the data from the database into a
pandas data frame for more processing. We document the process of connecting to the
database and loading the data into the data frame in this section.

The following connection string format can be used to connect to a SQL Server
database from Python using pyodbc (replace servername, dbname, username, and
password with your specific values):

Python

#Set up the SQL Azure connection


import pyodbc
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=
<dbname>;UID=<username>;PWD=<password>')

The Pandas library in Python provides a rich set of data structures and data analysis
tools for data manipulation for Python programming. The code below reads the results
returned from a SQL Server database into a Pandas data frame:

Python

# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select <columnname1>, <columnname2>... from
<tablename>''', conn)

Now you can work with the Pandas data frame as covered in the article Process Azure
Blob data in your data science environment.

Azure Data Science in Action Example


For an end-to-end walkthrough example of the Azure Data Science Process using a
public dataset, see Azure Data Science Process in Action.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Cheat sheet for Azure Machine Learning
designer
Article • 11/21/2022

The Azure Machine Learning Algorithm Cheat Sheet helps you choose the right
algorithm from the designer for a predictive analytics model.

Azure Machine Learning has a large library of algorithms from the classification,
recommender systems, clustering, anomaly detection, regression, and text analytics
families. Each is designed to address a different type of machine learning problem.

Download: Machine Learning Algorithm Cheat


Sheet
Once you download the cheat sheet, you can print it in tabloid size (11 x 17 in.).

Download the cheat sheet here: Machine Learning Algorithm Cheat Sheet (11x17 in.)

Download and print the Machine Learning Algorithm Cheat Sheet in tabloid size to keep
it handy and get help choosing an algorithm.

More help with Machine Learning


For an overview of Microsoft Azure Machine Learning designer see What is Azure
Machine Learning designer?
For an overview of Microsoft Azure Machine Learning, see What is Azure Machine
Learning?.
For an explanation of how to deploy a scoring web service, see Deploy machine
learning models to Azure.
For a discussion of how to consume a scoring web service, see Consume an Azure
Machine Learning model deployed as a web service.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Data science using Spark on Azure
HDInsight
Article • 11/21/2022

This suite of topics shows how to use HDInsight Spark to complete common data
science tasks such as data ingestion, feature engineering, modeling, and model
evaluation. The data used is a sample of the 2013 NYC taxi trip and fare dataset. The
models built include logistic and linear regression, random forests, and gradient
boosted trees. The topics also show how to store these models in Azure blob storage
(WASB) and how to score and evaluate their predictive performance. More advanced
topics cover how models can be trained using cross-validation and hyper-parameter
sweeping. This overview topic also references the topics that describe how to set up the
Spark cluster that you need to complete the steps in the walkthroughs provided.

Spark and MLlib


Spark is an open-source parallel processing framework that supports in-memory
processing to boost the performance of big-data analytic applications. The Spark
processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-
memory distributed computation capabilities make it a good choice for the iterative
algorithms used in machine learning and graph computations. MLlib is Spark's
scalable machine learning library that brings the algorithmic modeling capabilities to
this distributed environment.

HDInsight Spark
HDInsight Spark is the Azure hosted offering of open-source Spark. It also includes
support for Jupyter PySpark notebooks on the Spark cluster that can run Spark SQL
interactive queries for transforming, filtering, and visualizing data stored in Azure Blobs
(WASB). PySpark is the Python API for Spark. The code snippets that provide the
solutions and show the relevant plots to visualize the data here run in Jupyter
notebooks installed on the Spark clusters. The modeling steps in these topics contain
code that shows how to train, evaluate, save, and consume each type of model.

Setup: Spark clusters and Jupyter notebooks


Setup steps and code are provided in this walkthrough for using an HDInsight Spark 1.6.
But Jupyter notebooks are provided for both HDInsight Spark 1.6 and Spark 2.0 clusters.
A description of the notebooks and links to them are provided in the Readme.md for
the GitHub repository containing them. Moreover, the code here and in the linked
notebooks is generic and should work on any Spark cluster. If you are not using
HDInsight Spark, the cluster setup and management steps may be slightly different from
what is shown here. For convenience, here are the links to the Jupyter notebooks for
Spark 1.6 (to be run in the pySpark kernel of the Jupyter Notebook server) and Spark 2.0
(to be run in the pySpark3 kernel of the Jupyter Notebook server):

Spark 1.6 notebooks


These notebooks are to be run in the pySpark kernel of Jupyter notebook server.

pySpark-machine-learning-data-science-spark-data-exploration-modeling.ipynb :
Provides information on how to perform data exploration, modeling, and scoring
with several different algorithms.
pySpark-machine-learning-data-science-spark-advanced-data-exploration-
modeling.ipynb : Includes topics in notebook #1, and model development using
hyperparameter tuning and cross-validation.
pySpark-machine-learning-data-science-spark-model-consumption.ipynb :
Shows how to operationalize a saved model using Python on HDInsight clusters.

Spark 2.0 notebooks


These notebooks are to be run in the pySpark3 kernel of Jupyter notebook server.

Spark2.0-pySpark3-machine-learning-data-science-spark-advanced-data-
exploration-modeling.ipynb : This file provides information on how to perform
data exploration, modeling, and scoring in Spark 2.0 clusters using the NYC Taxi
trip and fare data-set described here. This notebook may be a good starting point
for quickly exploring the code we have provided for Spark 2.0. For a more detailed
notebook analyzes the NYC Taxi data, see the next notebook in this list. See the
notes following this list that compares these notebooks.
Spark2.0-pySpark3_NYC_Taxi_Tip_Regression.ipynb : This file shows how to
perform data wrangling (Spark SQL and dataframe operations), exploration,
modeling and scoring using the NYC Taxi trip and fare data-set described here.
Spark2.0-pySpark3_Airline_Departure_Delay_Classification.ipynb : This file shows
how to perform data wrangling (Spark SQL and dataframe operations), exploration,
modeling and scoring using the well-known Airline On-time departure dataset
from 2011 and 2012. We integrated the airline dataset with the airport weather
data (for example, windspeed, temperature, altitude etc.) prior to modeling, so
these weather features can be included in the model.
7 Note

The airline dataset was added to the Spark 2.0 notebooks to better illustrate the
use of classification algorithms. See the following links for information about airline
on-time departure dataset and weather dataset:

Airline on-time departure data: https://www.transtats.bts.gov/ONTIME/

Airport weather data: https://www.ncdc.noaa.gov/

7 Note

The Spark 2.0 notebooks on the NYC taxi and airline flight delay data-sets can take
10 mins or more to run (depending on the size of your HDI cluster). The first
notebook in the above list shows many aspects of the data exploration,
visualization and ML model training in a notebook that takes less time to run with
down-sampled NYC data set, in which the taxi and fare files have been pre-joined:
Spark2.0-pySpark3-machine-learning-data-science-spark-advanced-data-
exploration-modeling.ipynb . This notebook takes a much shorter time to finish
(2-3 mins) and may be a good starting point for quickly exploring the code we have
provided for Spark 2.0.

For guidance on the operationalization of a Spark 2.0 model and model consumption
for scoring, see the Spark 1.6 document on consumption for an example outlining the
steps required. To use this example on Spark 2.0, replace the Python code file with this
file .

Prerequisites
The following procedures are related to Spark 1.6. For the Spark 2.0 version, use the
notebooks described and linked to previously.

1. You must have an Azure subscription. If you do not already have one, see Get
Azure free trial .

2. You need a Spark 1.6 cluster to complete this walkthrough. To create one, see the
instructions provided in Get started: create Apache Spark on Azure HDInsight. The
cluster type and version is specified from the Select Cluster Type menu.
7 Note

For a topic that shows how to use Scala rather than Python to complete tasks for an
end-to-end data science process, see the Data Science using Scala with Spark on
Azure.

2 Warning

Billing for HDInsight clusters is prorated per minute, whether you use them or
not. Be sure to delete your cluster after you finish using it. See how to delete
an HDInsight cluster.

The NYC 2013 Taxi data


The NYC Taxi Trip data is about 20 GB of compressed comma-separated values (CSV)
files (~48 GB uncompressed), comprising more than 173 million individual trips and the
fares paid for each trip. Each trip record includes the pickup and dropoff location and
time, anonymized hack (driver's) license number and medallion (taxi's unique id)
number. The data covers all trips in the year 2013 and is provided in the following two
datasets for each month:

1. The 'trip_data' CSV files contain trip details, such as number of passengers, pick up
and dropoff points, trip duration, and trip length. Here are a few sample records:
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,

dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longit
ude,pickup_latitude,dropoff_longitude,dropoff_latitude

89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013
-01-01 15:11:48,2013-01-01

15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013
-01-06 00:18:35,2013-01-06

00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013

-01-05 18:49:41,2013-01-05

18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013

-01-07 23:54:15,2013-01-07
23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013

-01-07 23:25:03,2013-01-07
23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868

2. The 'trip_fare' CSV files contain details of the fare paid for each trip, such as
payment type, fare amount, surcharge and taxes, tips and tolls, and the total
amount paid. Here are a few sample records:

medallion, hack_license, vendor_id, pickup_datetime, payment_type,

fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount

89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-
01 15:11:48,CSH,6.5,0,0.5,0,0,7

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-
06 00:18:35,CSH,6,0.5,0.5,0,0,7

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-

05 18:49:41,CSH,5.5,1,0.5,0,0,7

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-

07 23:54:15,CSH,5,0.5,0.5,0,0,6
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-

07 23:25:03,CSH,9.5,0.5,0.5,0,0,10.5

We have taken a 0.1% sample of these files and joined the trip_data and trip_fare CVS
files into a single dataset to use as the input dataset for this walkthrough. The unique
key to join trip_data and trip_fare is composed of the fields: medallion, hack_licence and
pickup_datetime. Each record of the dataset contains the following attributes
representing a NYC Taxi trip:

Field Brief Description

medallion Anonymized taxi medallion (unique taxi id)

hack_license Anonymized Hackney Carriage License number

vendor_id Taxi vendor id

rate_code NYC taxi rate of fare

store_and_fwd_flag Store and forward flag

pickup_datetime Pick up date & time

dropoff_datetime Dropoff date & time

pickup_hour Pick up hour

pickup_week Pick up week of the year

weekday Weekday (range 1-7)

passenger_count Number of passengers in a taxi trip

trip_time_in_secs Trip time in seconds

trip_distance Trip distance traveled in miles

pickup_longitude Pick up longitude

pickup_latitude Pick up latitude

dropoff_longitude Dropoff longitude

dropoff_latitude Dropoff latitude

direct_distance Direct distance between pickup and dropoff locations

payment_type Payment type (cash, credit-card etc.)

fare_amount Fare amount in


Field Brief Description

surcharge Surcharge

mta_tax MTA Metro Transportation tax

tip_amount Tip amount

tolls_amount Tolls amount

total_amount Total amount

tipped Tipped (0/1 for no or yes)

tip_class Tip class (0: $0, 1: $0-5, 2: $6-10, 3: $11-20, 4: > $20)

Execute code from a Jupyter notebook on the


Spark cluster
You can launch the Jupyter Notebook from the Azure portal. Find your Spark cluster on
your dashboard and click it to enter management page for your cluster. To open the
notebook associated with the Spark cluster, click Cluster Dashboards -> Jupyter
Notebook.
You can also browse to https://CLUSTERNAME.azurehdinsight.net/jupyter to access the
Jupyter Notebooks. Replace the CLUSTERNAME part of this URL with the name of your
own cluster. You need the password for your admin account to access the notebooks.

Select PySpark to see a directory that contains a few examples of pre-packaged


notebooks that use the PySpark API. The notebooks that contain the code samples for
this suite of Spark topic are available at GitHub
You can upload the notebooks directly from GitHub to the Jupyter notebook server
on your Spark cluster. On the home page of your Jupyter, click the Upload button on
the right part of the screen. It opens a file explorer. Here you can paste the GitHub (raw
content) URL of the Notebook and click Open.

You see the file name on your Jupyter file list with an Upload button again. Click this
Upload button. Now you have imported the notebook. Repeat these steps to upload the
other notebooks from this walkthrough.

 Tip

You can right-click the links on your browser and select Copy Link to get the
GitHub raw content URL. You can paste this URL into the Jupyter Upload file
explorer dialog box.

Now you can:

See the code by clicking the notebook.


Execute each cell by pressing SHIFT-ENTER.
Run the entire notebook by clicking on Cell -> Run.
Use the automatic visualization of queries.

 Tip

The PySpark kernel automatically visualizes the output of SQL (HiveQL) queries. You
are given the option to select among several different types of visualizations (Table,
Pie, Line, Area, or Bar) by using the Type menu buttons in the notebook:
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Data Science using Scala and Spark on
Azure
Article • 11/21/2022

This article shows you how to use Scala for supervised machine learning tasks with the
Spark scalable MLlib and Spark ML packages on an Azure HDInsight Spark cluster. It
walks you through the tasks that constitute the Data Science process: data ingestion and
exploration, visualization, feature engineering, modeling, and model consumption. The
models in the article include logistic and linear regression, random forests, and
gradient-boosted trees (GBTs), in addition to two common supervised machine learning
tasks:

Regression problem: Prediction of the tip amount ($) for a taxi trip
Binary classification: Prediction of tip or no tip (1/0) for a taxi trip

The modeling process requires training and evaluation on a test data set and relevant
accuracy metrics. In this article, you can learn how to store these models in Azure Blob
storage and how to score and evaluate their predictive performance. This article also
covers the more advanced topics of how to optimize models by using cross-validation
and hyper-parameter sweeping. The data used is a sample of the 2013 NYC taxi trip and
fare data set available on GitHub.

Scala , a language based on the Java virtual machine, integrates object-oriented and
functional language concepts. It's a scalable language that is well suited to distributed
processing in the cloud, and runs on Azure Spark clusters.

Spark is an open-source parallel-processing framework that supports in-memory


processing to boost the performance of big data analytics applications. The Spark
processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-
memory distributed computation capabilities make it a good choice for iterative
algorithms in machine learning and graph computations. The spark.ml package
provides a uniform set of high-level APIs built on top of data frames that can help you
create and tune practical machine learning pipelines. MLlib is Spark's scalable
machine learning library, which brings modeling capabilities to this distributed
environment.

HDInsight Spark is the Azure-hosted offering of open-source Spark. It also includes


support for Jupyter Scala notebooks on the Spark cluster, and can run Spark SQL
interactive queries to transform, filter, and visualize data stored in Azure Blob storage.
The Scala code snippets in this article that provide the solutions and show the relevant
plots to visualize the data run in Jupyter notebooks installed on the Spark clusters. The
modeling steps in these topics have code that shows you how to train, evaluate, save,
and consume each type of model.

The setup steps and code in this article are for Azure HDInsight 3.4 Spark 1.6. However,
the code in this article and in the Scala Jupyter Notebook are generic and should work
on any Spark cluster. The cluster setup and management steps might be slightly
different from what is shown in this article if you are not using HDInsight Spark.

7 Note

For a topic that shows you how to use Python rather than Scala to complete tasks
for an end-to-end Data Science process, see Data Science using Spark on Azure
HDInsight.

Prerequisites
You must have an Azure subscription. If you do not already have one, get an Azure
free trial .
You need an Azure HDInsight 3.4 Spark 1.6 cluster to complete the following
procedures. To create a cluster, see the instructions in Get started: Create Apache
Spark on Azure HDInsight. Set the cluster type and version on the Select Cluster
Type menu.

2 Warning
Billing for HDInsight clusters is prorated per minute, whether you use them or
not. Be sure to delete your cluster after you finish using it. See how to delete
an HDInsight cluster.

For a description of the NYC taxi trip data and instructions on how to execute code from
a Jupyter notebook on the Spark cluster, see the relevant sections in Overview of Data
Science using Spark on Azure HDInsight.

Execute Scala code from a Jupyter notebook on


the Spark cluster
You can launch a Jupyter notebook from the Azure portal. Find the Spark cluster on your
dashboard, and then click it to enter the management page for your cluster. Next, click
Cluster Dashboards, and then click Jupyter Notebook to open the notebook associated
with the Spark cluster.
You also can access Jupyter notebooks at
https://<clustername>.azurehdinsight.net/jupyter . Replace <clustername> with the
name of your cluster. You need the password for your administrator account to access
the Jupyter notebooks.

Select Scala to see a directory that has a few examples of prepackaged notebooks that
use the PySpark API. The Exploration Modeling and Scoring using Scala.ipynb notebook
that contains the code samples for this suite of Spark topics is available on GitHub .
You can upload the notebook directly from GitHub to the Jupyter Notebook server on
your Spark cluster. On your Jupyter home page, click the Upload button. In the file
explorer, paste the GitHub (raw content) URL of the Scala notebook, and then click
Open. The Scala notebook is available at the following URL:

Exploration-Modeling-and-Scoring-using-Scala.ipynb

Setup: Preset Spark and Hive contexts, Spark


magics, and Spark libraries

Preset Spark and Hive contexts


Scala

# SET THE START TIME


import java.util.Calendar
val beginningTime = Calendar.getInstance().getTime()

The Spark kernels that are provided with Jupyter notebooks have preset contexts. You
don't need to explicitly set the Spark or Hive contexts before you start working with the
application you are developing. The preset contexts are:

sc for SparkContext
sqlContext for HiveContext

Spark magics
The Spark kernel provides some predefined "magics," which are special commands that
you can call with %% . Two of these commands are used in the following code samples.

%%local specifies that the code in subsequent lines will be executed locally. The
code must be valid Scala code.
%%sql -o <variable name> executes a Hive query against sqlContext . If the -o

parameter is passed, the result of the query is persisted in the %%local Scala
context as a Spark data frame.

For more information about the kernels for Jupyter notebooks and their predefined
"magics" that you call with %% (for example, %%local ), see Kernels available for Jupyter
notebooks with HDInsight Spark Linux clusters on HDInsight.
Import libraries
Import the Spark, MLlib, and other libraries you'll need by using the following code.

Scala

# IMPORT SPARK AND JAVA LIBRARIES


import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import java.text.SimpleDateFormat
import java.util.Calendar
import sqlContext.implicits._
import org.apache.spark.sql.Row

# IMPORT SPARK SQL FUNCTIONS


import org.apache.spark.sql.types.{StructType, StructField, StringType,
IntegerType, FloatType, DoubleType}
import org.apache.spark.sql.functions.rand

# IMPORT SPARK ML FUNCTIONS


import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler,
OneHotEncoder, VectorIndexer, Binarizer}
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit,
CrossValidator}
import org.apache.spark.ml.regression.{LinearRegression,
LinearRegressionModel, RandomForestRegressor, RandomForestRegressionModel,
GBTRegressor, GBTRegressionModel}
import org.apache.spark.ml.classification.{LogisticRegression,
LogisticRegressionModel, RandomForestClassifier,
RandomForestClassificationModel, GBTClassifier, GBTClassificationModel}
import org.apache.spark.ml.evaluation.{BinaryClassificationEvaluator,
RegressionEvaluator, MulticlassClassificationEvaluator}

# IMPORT SPARK MLLIB FUNCTIONS


import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS,
LogisticRegressionModel}
import org.apache.spark.mllib.regression.{LabeledPoint,
LinearRegressionWithSGD, LinearRegressionModel}
import org.apache.spark.mllib.tree.{GradientBoostedTrees, RandomForest}
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel,
RandomForestModel, Predict}
import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics,
MulticlassMetrics, RegressionMetrics}

# SPECIFY SQLCONTEXT
val sqlContext = new SQLContext(sc)
Data ingestion
The first step in the Data Science process is to ingest the data that you want to analyze.
You bring the data from external sources or systems where it resides into your data
exploration and modeling environment. In this article, the data you ingest is a joined
0.1% sample of the taxi trip and fare file (stored as a .tsv file). The data exploration and
modeling environment is Spark. This section contains the code to complete the
following series of tasks:

1. Set directory paths for data and model storage.


2. Read in the input data set (stored as a .tsv file).
3. Define a schema for the data and clean the data.
4. Create a cleaned data frame and cache it in memory.
5. Register the data as a temporary table in SQLContext.
6. Query the table and import the results into a data frame.

Set directory paths for storage locations in Azure Blob


storage
Spark can read and write to Azure Blob storage. You can use Spark to process any of
your existing data, and then store the results again in Blob storage.

To save models or files in Blob storage, you need to properly specify the path. Reference
the default container attached to the Spark cluster by using a path that begins with
wasb:/// . Reference other locations by using wasb:// .

The following code sample specifies the location of the input data to be read and the
path to Blob storage that is attached to the Spark cluster where the model will be saved.

Scala

# SET PATHS TO DATA AND MODEL FILE LOCATIONS


# INGEST DATA AND SPECIFY HEADERS FOR COLUMNS
val taxi_train_file =
sc.textFile("wasb://mllibwalkthroughs@cdspsparksamples.blob.core.windows.net
/Data/NYCTaxi/JoinedTaxiTripFare.Point1Pct.Train.tsv")
val header = taxi_train_file.first;

# SET THE MODEL STORAGE DIRECTORY PATH


# NOTE THAT THE FINAL BACKSLASH IN THE PATH IS REQUIRED.
val modelDir = "wasb:///user/remoteuser/NYCTaxi/Models/";
Import data, create an RDD, and define a data frame
according to the schema
Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# DEFINE THE SCHEMA BASED ON THE HEADER OF THE FILE


val sqlContext = new SQLContext(sc)
val taxi_schema = StructType(
Array(
StructField("medallion", StringType, true),
StructField("hack_license", StringType, true),
StructField("vendor_id", StringType, true),
StructField("rate_code", DoubleType, true),
StructField("store_and_fwd_flag", StringType, true),
StructField("pickup_datetime", StringType, true),
StructField("dropoff_datetime", StringType, true),
StructField("pickup_hour", DoubleType, true),
StructField("pickup_week", DoubleType, true),
StructField("weekday", DoubleType, true),
StructField("passenger_count", DoubleType, true),
StructField("trip_time_in_secs", DoubleType, true),
StructField("trip_distance", DoubleType, true),
StructField("pickup_longitude", DoubleType, true),
StructField("pickup_latitude", DoubleType, true),
StructField("dropoff_longitude", DoubleType, true),
StructField("dropoff_latitude", DoubleType, true),
StructField("direct_distance", StringType, true),
StructField("payment_type", StringType, true),
StructField("fare_amount", DoubleType, true),
StructField("surcharge", DoubleType, true),
StructField("mta_tax", DoubleType, true),
StructField("tip_amount", DoubleType, true),
StructField("tolls_amount", DoubleType, true),
StructField("total_amount", DoubleType, true),
StructField("tipped", DoubleType, true),
StructField("tip_class", DoubleType, true)
)
)

# CAST VARIABLES ACCORDING TO THE SCHEMA


val taxi_temp = (taxi_train_file.map(_.split("\t"))
.filter((r) => r(0) != "medallion")
.map(p => Row(p(0), p(1), p(2),
p(3).toDouble, p(4), p(5), p(6), p(7).toDouble,
p(8).toDouble, p(9).toDouble, p(10).toDouble,
p(11).toDouble, p(12).toDouble, p(13).toDouble,
p(14).toDouble, p(15).toDouble, p(16).toDouble,
p(17), p(18), p(19).toDouble, p(20).toDouble,
p(21).toDouble, p(22).toDouble,
p(23).toDouble, p(24).toDouble, p(25).toDouble,
p(26).toDouble)))

# CREATE AN INITIAL DATA FRAME AND DROP COLUMNS, AND THEN CREATE A CLEANED
DATA FRAME BY FILTERING FOR UNWANTED VALUES OR OUTLIERS
val taxi_train_df = sqlContext.createDataFrame(taxi_temp, taxi_schema)

val taxi_df_train_cleaned =
(taxi_train_df.drop(taxi_train_df.col("medallion"))

.drop(taxi_train_df.col("hack_license")).drop(taxi_train_df.col("store_and_f
wd_flag"))

.drop(taxi_train_df.col("pickup_datetime")).drop(taxi_train_df.col("dropoff_
datetime"))

.drop(taxi_train_df.col("pickup_longitude")).drop(taxi_train_df.col("pickup_
latitude"))

.drop(taxi_train_df.col("dropoff_longitude")).drop(taxi_train_df.col("dropof
f_latitude"))

.drop(taxi_train_df.col("surcharge")).drop(taxi_train_df.col("mta_tax"))

.drop(taxi_train_df.col("direct_distance")).drop(taxi_train_df.col("tolls_am
ount"))

.drop(taxi_train_df.col("total_amount")).drop(taxi_train_df.col("tip_class")
)
.filter("passenger_count > 0 and passenger_count < 8 AND
payment_type in ('CSH', 'CRD') AND tip_amount >= 0 AND tip_amount < 30 AND
fare_amount >= 1 AND fare_amount < 150 AND trip_distance > 0 AND
trip_distance < 100 AND trip_time_in_secs > 30 AND trip_time_in_secs <
7200"));

# CACHE AND MATERIALIZE THE CLEANED DATA FRAME IN MEMORY


taxi_df_train_cleaned.cache()
taxi_df_train_cleaned.count()

# REGISTER THE DATA FRAME AS A TEMPORARY TABLE IN SQLCONTEXT


taxi_df_train_cleaned.registerTempTable("taxi_train")

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

Output:

Time to run the cell: 8 seconds.

Query the table and import results in a data frame


Next, query the table for fare, passenger, and tip data; filter out corrupt and outlying
data; and print several rows.

Scala

# QUERY THE DATA


val sqlStatement = """
SELECT fare_amount, passenger_count, tip_amount, tipped
FROM taxi_train
WHERE passenger_count > 0 AND passenger_count < 7
AND fare_amount > 0 AND fare_amount < 200
AND payment_type in ('CSH', 'CRD')
AND tip_amount > 0 AND tip_amount < 25
"""
val sqlResultsDF = sqlContext.sql(sqlStatement)

# SHOW ONLY THE TOP THREE ROWS


sqlResultsDF.show(3)

Output:

fare_amount passenger_count tip_amount tipped

13.5 1.0 2.9 1.0

16.0 2.0 3.4 1.0

10.5 2.0 1.0 1.0

Data exploration and visualization


After you bring the data into Spark, the next step in the Data Science process is to gain
a deeper understanding of the data through exploration and visualization. In this
section, you examine the taxi data by using SQL queries. Then, import the results into a
data frame to plot the target variables and prospective features for visual inspection by
using the automatic visualization Jupyter feature.

Use local and SQL magic to plot data


By default, the output of any code snippet that you run from a Jupyter notebook is
available within the context of the session that is persisted on the worker nodes. If you
want to save a trip to the worker nodes for every computation, and if all the data that
you need for your computation is available locally on the Jupyter server node (which is
the head node), you can use the %%local magic to run the code snippet on the Jupyter
server.
SQL magic ( %%sql ). The HDInsight Spark kernel supports easy inline HiveQL
queries against SQLContext. The ( -o VARIABLE_NAME ) argument persists the output
of the SQL query as a Pandas data frame on the Jupyter server. This setting means
the output will be available in the local mode.
%%local magic. The %%local magic runs the code locally on the Jupyter server,

which is the head node of the HDInsight cluster. Typically, you use %%local magic
in conjunction with the %%sql magic with the -o parameter. The -o parameter
would persist the output of the SQL query locally, and then %%local magic would
trigger the next set of code snippet to run locally against the output of the SQL
queries that is persisted locally.

Query the data by using SQL


This query retrieves the taxi trips by fare amount, passenger count, and tip amount.

Scala

# RUN THE SQL QUERY


%%sql -q -o sqlResults
SELECT fare_amount, passenger_count, tip_amount, tipped FROM taxi_train
WHERE passenger_count > 0 AND passenger_count < 7 AND fare_amount > 0 AND
fare_amount < 200 AND payment_type in ('CSH', 'CRD') AND tip_amount > 0 AND
tip_amount < 25

In the following code, the %%local magic creates a local data frame, sqlResults. You can
use sqlResults to plot by using matplotlib.

 Tip

Local magic is used multiple times in this article. If your data set is large, please
sample to create a data frame that can fit in local memory.

Plot the data


You can plot by using Python code after the data frame is in local context as a Pandas
data frame.

Scala

# RUN THE CODE LOCALLY ON THE JUPYTER SERVER


%%local
# USE THE JUPYTER AUTO-PLOTTING FEATURE TO CREATE INTERACTIVE FIGURES.
# CLICK THE TYPE OF PLOT TO GENERATE (LINE, AREA, BAR, ETC.)
sqlResults

The Spark kernel automatically visualizes the output of SQL (HiveQL) queries after you
run the code. You can choose between several types of visualizations:

Table
Pie
Line
Area
Bar

Here's the code to plot the data:

Scala

# RUN THE CODE LOCALLY ON THE JUPYTER SERVER AND IMPORT LIBRARIES
%%local
import matplotlib.pyplot as plt
%matplotlib inline

# PLOT TIP BY PAYMENT TYPE AND PASSENGER COUNT


ax1 = sqlResults[['tip_amount']].plot(kind='hist', bins=25,
facecolor='lightblue')
ax1.set_title('Tip amount distribution')
ax1.set_xlabel('Tip Amount ($)')
ax1.set_ylabel('Counts')
plt.suptitle('')
plt.show()

# PLOT TIP BY PASSENGER COUNT


ax2 = sqlResults.boxplot(column=['tip_amount'], by=['passenger_count'])
ax2.set_title('Tip amount by Passenger count')
ax2.set_xlabel('Passenger count')
ax2.set_ylabel('Tip Amount ($)')
plt.suptitle('')
plt.show()

# PLOT TIP AMOUNT BY FARE AMOUNT; SCALE POINTS BY PASSENGER COUNT


ax = sqlResults.plot(kind='scatter', x= 'fare_amount', y = 'tip_amount',
c='blue', alpha = 0.10, s=5*(sqlResults.passenger_count))
ax.set_title('Tip amount by Fare amount')
ax.set_xlabel('Fare Amount ($)')
ax.set_ylabel('Tip Amount ($)')
plt.axis([-2, 80, -2, 20])
plt.show()

Output:
Create features and transform features, and
then prep data for input into modeling
functions
For tree-based modeling functions from Spark ML and MLlib, you have to prepare target
and features by using a variety of techniques, such as binning, indexing, one-hot
encoding, and vectorization. Here are the procedures to follow in this section:

1. Create a new feature by binning hours into traffic time buckets.


2. Apply indexing and one-hot encoding to categorical features.
3. Sample and split the data set into training and test fractions.
4. Specify training variable and features, and then create indexed or one-hot
encoded training and testing input labeled point resilient distributed datasets
(RDDs) or data frames.
5. Automatically categorize and vectorize features and targets to use as inputs for
machine learning models.

Create a new feature by binning hours into traffic time


buckets
This code shows you how to create a new feature by binning hours into traffic time
buckets and how to cache the resulting data frame in memory. Where RDDs and data
frames are used repeatedly, caching leads to improved execution times. Accordingly,
you'll cache RDDs and data frames at several stages in the following procedures.

Scala

# CREATE FOUR BUCKETS FOR TRAFFIC TIMES


val sqlStatement = """
SELECT *,
CASE
WHEN (pickup_hour <= 6 OR pickup_hour >= 20) THEN "Night"
WHEN (pickup_hour >= 7 AND pickup_hour <= 10) THEN "AMRush"
WHEN (pickup_hour >= 11 AND pickup_hour <= 15) THEN "Afternoon"
WHEN (pickup_hour >= 16 AND pickup_hour <= 19) THEN "PMRush"
END as TrafficTimeBins
FROM taxi_train
"""
val taxi_df_train_with_newFeatures = sqlContext.sql(sqlStatement)

# CACHE THE DATA FRAME IN MEMORY AND MATERIALIZE THE DATA FRAME IN MEMORY
taxi_df_train_with_newFeatures.cache()
taxi_df_train_with_newFeatures.count()

Indexing and one-hot encoding of categorical features


The modeling and predict functions of MLlib require features with categorical input data
to be indexed or encoded prior to use. This section shows you how to index or encode
categorical features for input into the modeling functions.

You need to index or encode your models in different ways, depending on the model.
For example, logistic and linear regression models require one-hot encoding. For
example, a feature with three categories can be expanded into three feature columns.
Each column would contain 0 or 1 depending on the category of an observation. MLlib
provides the OneHotEncoder function for one-hot encoding. This encoder maps a
column of label indices to a column of binary vectors with at most a single one-value.
With this encoding, algorithms that expect numerical valued features, such as logistic
regression, can be applied to categorical features.

Here you transform only four variables to show examples, which are character strings.
You also can index other variables, such as weekday, represented by numerical values, as
categorical variables.

For indexing, use StringIndexer() , and for one-hot encoding, use OneHotEncoder()
functions from MLlib. Here is the code to index and encode categorical features:

Scala

# CREATE INDEXES AND ONE-HOT ENCODED VECTORS FOR SEVERAL CATEGORICAL


FEATURES

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# INDEX AND ENCODE VENDOR_ID


val stringIndexer = new
StringIndexer().setInputCol("vendor_id").setOutputCol("vendorIndex").fit(tax
i_df_train_with_newFeatures)
val indexed = stringIndexer.transform(taxi_df_train_with_newFeatures)
val encoder = new
OneHotEncoder().setInputCol("vendorIndex").setOutputCol("vendorVec")
val encoded1 = encoder.transform(indexed)

# INDEX AND ENCODE RATE_CODE


val stringIndexer = new
StringIndexer().setInputCol("rate_code").setOutputCol("rateIndex").fit(encod
ed1)
val indexed = stringIndexer.transform(encoded1)
val encoder = new
OneHotEncoder().setInputCol("rateIndex").setOutputCol("rateVec")
val encoded2 = encoder.transform(indexed)

# INDEX AND ENCODE PAYMENT_TYPE


val stringIndexer = new
StringIndexer().setInputCol("payment_type").setOutputCol("paymentIndex").fit
(encoded2)
val indexed = stringIndexer.transform(encoded2)
val encoder = new
OneHotEncoder().setInputCol("paymentIndex").setOutputCol("paymentVec")
val encoded3 = encoder.transform(indexed)

# INDEX AND TRAFFIC TIME BINS


val stringIndexer = new
StringIndexer().setInputCol("TrafficTimeBins").setOutputCol("TrafficTimeBins
Index").fit(encoded3)
val indexed = stringIndexer.transform(encoded3)
val encoder = new
OneHotEncoder().setInputCol("TrafficTimeBinsIndex").setOutputCol("TrafficTim
eBinsVec")
val encodedFinal = encoder.transform(indexed)

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

Output:

Time to run the cell: 4 seconds.

Sample and split the data set into training and test
fractions
This code creates a random sampling of the data (25%, in this example). Although
sampling is not required for this example due to the size of the data set, the article
shows you how you can sample so that you know how to use it for your own problems
when needed. When samples are large, this can save significant time while you train
models. Next, split the sample into a training part (75%, in this example) and a testing
part (25%, in this example) to use in classification and regression modeling.

Add a random number (between 0 and 1) to each row (in a "rand" column) that can be
used to select cross-validation folds during training.

Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# SPECIFY SAMPLING AND SPLITTING FRACTIONS


val samplingFraction = 0.25;
val trainingFraction = 0.75;
val testingFraction = (1-trainingFraction);
val seed = 1234;
val encodedFinalSampledTmp = encodedFinal.sample(withReplacement = false,
fraction = samplingFraction, seed = seed)
val sampledDFcount = encodedFinalSampledTmp.count().toInt
val generateRandomDouble = udf(() => {
scala.util.Random.nextDouble
})

# ADD A RANDOM NUMBER FOR CROSS-VALIDATION


val encodedFinalSampled = encodedFinalSampledTmp.withColumn("rand",
generateRandomDouble());

# SPLIT THE SAMPLED DATA FRAME INTO TRAIN AND TEST, WITH A RANDOM COLUMN
ADDED FOR DOING CROSS-VALIDATION (SHOWN LATER)
# INCLUDE A RANDOM COLUMN FOR CREATING CROSS-VALIDATION FOLDS
val splits = encodedFinalSampled.randomSplit(Array(trainingFraction,
testingFraction), seed = seed)
val trainData = splits(0)
val testData = splits(1)

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

Output:

Time to run the cell: 2 seconds.

Specify training variable and features, and then create


indexed or one-hot encoded training and testing input
labeled point RDDs or data frames
This section contains code that shows you how to index categorical text data as a
labeled point data type, and encode it so you can use it to train and test MLlib logistic
regression and other classification models. Labeled point objects are RDDs that are
formatted in a way that is needed as input data by most of machine learning algorithms
in MLlib. A labeled point is a local vector, either dense or sparse, associated with a
label/response.

In this code, you specify the target (dependent) variable and the features to use to train
models. Then, you create indexed or one-hot encoded training and testing input labeled
point RDDs or data frames.

Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# MAP NAMES OF FEATURES AND TARGETS FOR CLASSIFICATION AND REGRESSION


PROBLEMS
val featuresIndOneHot = List("paymentVec", "vendorVec", "rateVec",
"TrafficTimeBinsVec", "pickup_hour", "weekday", "passenger_count",
"trip_time_in_secs", "trip_distance",
"fare_amount").map(encodedFinalSampled.columns.indexOf(_))
val featuresIndIndex = List("paymentIndex", "vendorIndex", "rateIndex",
"TrafficTimeBinsIndex", "pickup_hour", "weekday", "passenger_count",
"trip_time_in_secs", "trip_distance",
"fare_amount").map(encodedFinalSampled.columns.indexOf(_))

# SPECIFY THE TARGET FOR CLASSIFICATION ('tipped') AND REGRESSION


('tip_amount') PROBLEMS
val targetIndBinary =
List("tipped").map(encodedFinalSampled.columns.indexOf(_))
val targetIndRegression =
List("tip_amount").map(encodedFinalSampled.columns.indexOf(_))

# CREATE INDEXED LABELED POINT RDD OBJECTS


val indexedTRAINbinary = trainData.rdd.map(r =>
LabeledPoint(r.getDouble(targetIndBinary(0).toInt),
Vectors.dense(featuresIndIndex.map(r.getDouble(_)).toArray)))
val indexedTESTbinary = testData.rdd.map(r =>
LabeledPoint(r.getDouble(targetIndBinary(0).toInt),
Vectors.dense(featuresIndIndex.map(r.getDouble(_)).toArray)))
val indexedTRAINreg = trainData.rdd.map(r =>
LabeledPoint(r.getDouble(targetIndRegression(0).toInt),
Vectors.dense(featuresIndIndex.map(r.getDouble(_)).toArray)))
val indexedTESTreg = testData.rdd.map(r =>
LabeledPoint(r.getDouble(targetIndRegression(0).toInt),
Vectors.dense(featuresIndIndex.map(r.getDouble(_)).toArray)))

# CREATE INDEXED DATA FRAMES THAT YOU CAN USE TO TRAIN BY USING SPARK ML
FUNCTIONS
val indexedTRAINbinaryDF = indexedTRAINbinary.toDF()
val indexedTESTbinaryDF = indexedTESTbinary.toDF()
val indexedTRAINregDF = indexedTRAINreg.toDF()
val indexedTESTregDF = indexedTESTreg.toDF()

# CREATE ONE-HOT ENCODED (VECTORIZED) DATA FRAMES THAT YOU CAN USE TO TRAIN
BY USING SPARK ML FUNCTIONS
val assemblerOneHot = new VectorAssembler().setInputCols(Array("paymentVec",
"vendorVec", "rateVec", "TrafficTimeBinsVec", "pickup_hour", "weekday",
"passenger_count", "trip_time_in_secs", "trip_distance",
"fare_amount")).setOutputCol("features")
val OneHotTRAIN = assemblerOneHot.transform(trainData)
val OneHotTEST = assemblerOneHot.transform(testData)

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

Output:
Time to run the cell: 4 seconds.

Automatically categorize and vectorize features and


targets to use as inputs for machine learning models
Use Spark ML to categorize the target and features to use in tree-based modeling
functions. The code completes two tasks:

Creates a binary target for classification by assigning a value of 0 or 1 to each data


point between 0 and 1 by using a threshold value of 0.5.
Automatically categorizes features. If the number of distinct numerical values for
any feature is less than 32, that feature is categorized.

Here's the code for these two tasks.

Scala

# CATEGORIZE FEATURES AND BINARIZE THE TARGET FOR THE BINARY CLASSIFICATION
PROBLEM

# TRAIN DATA
val indexer = new
VectorIndexer().setInputCol("features").setOutputCol("featuresCat").setMaxCa
tegories(32)
val indexerModel = indexer.fit(indexedTRAINbinaryDF)
val indexedTrainwithCatFeat = indexerModel.transform(indexedTRAINbinaryDF)
val binarizer: Binarizer = new
Binarizer().setInputCol("label").setOutputCol("labelBin").setThreshold(0.5)
val indexedTRAINwithCatFeatBinTarget =
binarizer.transform(indexedTrainwithCatFeat)

# TEST DATA
val indexerModel = indexer.fit(indexedTESTbinaryDF)
val indexedTrainwithCatFeat = indexerModel.transform(indexedTESTbinaryDF)
val binarizer: Binarizer = new
Binarizer().setInputCol("label").setOutputCol("labelBin").setThreshold(0.5)
val indexedTESTwithCatFeatBinTarget =
binarizer.transform(indexedTrainwithCatFeat)

# CATEGORIZE FEATURES FOR THE REGRESSION PROBLEM


# CREATE PROPERLY INDEXED AND CATEGORIZED DATA FRAMES FOR TREE-BASED MODELS

# TRAIN DATA
val indexer = new
VectorIndexer().setInputCol("features").setOutputCol("featuresCat").setMaxCa
tegories(32)
val indexerModel = indexer.fit(indexedTRAINregDF)
val indexedTRAINwithCatFeat = indexerModel.transform(indexedTRAINregDF)

# TEST DATA
val indexerModel = indexer.fit(indexedTESTbinaryDF)
val indexedTESTwithCatFeat = indexerModel.transform(indexedTESTregDF)

Binary classification model: Predict whether a


tip should be paid
In this section, you create three types of binary classification models to predict whether
or not a tip should be paid:

A logistic regression model by using the Spark ML LogisticRegression() function


A random forest classification model by using the Spark ML
RandomForestClassifier() function

A gradient boosting tree classification model by using the MLlib


GradientBoostedTrees() function

Create a logistic regression model


Next, create a logistic regression model by using the Spark ML LogisticRegression()
function. You create the model building code in a series of steps:

1. Train the model data with one parameter set.


2. Evaluate the model on a test data set with metrics.
3. Save the model in Blob storage for future consumption.
4. Score the model against test data.
5. Plot the results with receiver operating characteristic (ROC) curves.

Here's the code for these procedures:

Scala

# CREATE A LOGISTIC REGRESSION MODEL


val lr = new
LogisticRegression().setLabelCol("tipped").setFeaturesCol("features").setMax
Iter(10).setRegParam(0.3).setElasticNetParam(0.8)
val lrModel = lr.fit(OneHotTRAIN)

# PREDICT ON THE TEST DATA SET


val predictions = lrModel.transform(OneHotTEST)

# SELECT `BinaryClassificationEvaluator()` TO COMPUTE THE TEST ERROR


val evaluator = new
BinaryClassificationEvaluator().setLabelCol("tipped").setRawPredictionCol("p
robability").setMetricName("areaUnderROC")
val ROC = evaluator.evaluate(predictions)
println("ROC on test data = " + ROC)
# SAVE THE MODEL
val datestamp = Calendar.getInstance().getTime().toString.replaceAll(" ",
".").replaceAll(":", "_");
val modelName = "LogisticRegression__"
val filename = modelDir.concat(modelName).concat(datestamp)
lrModel.save(filename);

Load, score, and save the results.

Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# LOAD THE SAVED MODEL AND SCORE THE TEST DATA SET
val savedModel =
org.apache.spark.ml.classification.LogisticRegressionModel.load(filename)
println(s"Coefficients: ${savedModel.coefficients} Intercept:
${savedModel.intercept}")

# SCORE THE MODEL ON THE TEST DATA


val predictions =
savedModel.transform(OneHotTEST).select("tipped","probability","rawPredictio
n")
predictions.registerTempTable("testResults")

# SELECT `BinaryClassificationEvaluator()` TO COMPUTE THE TEST ERROR


val evaluator = new
BinaryClassificationEvaluator().setLabelCol("tipped").setRawPredictionCol("p
robability").setMetricName("areaUnderROC")
val ROC = evaluator.evaluate(predictions)

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

# PRINT THE ROC RESULTS


println("ROC on test data = " + ROC)

Output:

ROC on test data = 0.9827381497557599

Use Python on local Pandas data frames to plot the ROC curve.

Scala
# QUERY THE RESULTS
%%sql -q -o sqlResults
SELECT tipped, probability from testResults

# RUN THE CODE LOCALLY ON THE JUPYTER SERVER AND IMPORT LIBRARIES
%%local
%matplotlib inline
from sklearn.metrics import roc_curve,auc

sqlResults['probFloat'] = sqlResults.apply(lambda row:


row['probability'].values()[0][1], axis=1)
predictions_pddf = sqlResults[["tipped","probFloat"]]

# PREDICT THE ROC CURVE


# predictions_pddf = sqlResults.rename(columns={'_1': 'probability',
'tipped': 'label'})
prob = predictions_pddf["probFloat"]
fpr, tpr, thresholds = roc_curve(predictions_pddf['tipped'], prob,
pos_label=1);
roc_auc = auc(fpr, tpr)

# PLOT THE ROC CURVE


plt.figure(figsize=(5,5))
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

Output:
Create a random forest classification model
Next, create a random forest classification model by using the Spark ML
RandomForestClassifier() function, and then evaluate the model on test data.

Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# CREATE THE RANDOM FOREST CLASSIFIER MODEL


val rf = new
RandomForestClassifier().setLabelCol("labelBin").setFeaturesCol("featuresCat
").setNumTrees(10).setSeed(1234)

# FIT THE MODEL


val rfModel = rf.fit(indexedTRAINwithCatFeatBinTarget)
val predictions = rfModel.transform(indexedTESTwithCatFeatBinTarget)

# EVALUATE THE MODEL


val evaluator = new
MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("p
rediction").setMetricName("f1")
val Test_f1Score = evaluator.evaluate(predictions)
println("F1 score on test data: " + Test_f1Score);

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

# CALCULATE BINARY CLASSIFICATION EVALUATION METRICS


val evaluator = new
BinaryClassificationEvaluator().setLabelCol("label").setRawPredictionCol("pr
obability").setMetricName("areaUnderROC")
val ROC = evaluator.evaluate(predictions)
println("ROC on test data = " + ROC)

Output:

ROC on test data = 0.9847103571552683

Create a GBT classification model


Next, create a GBT classification model by using MLlib's GradientBoostedTrees()
function, and then evaluate the model on test data.

Scala
# TRAIN A GBT CLASSIFICATION MODEL BY USING MLLIB AND A LABELED POINT

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# DEFINE THE GBT CLASSIFICATION MODEL


val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 20
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 5
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]((0,2),
(1,2),(2,6),(3,4))

# TRAIN THE MODEL


val gbtModel = GradientBoostedTrees.train(indexedTRAINbinary,
boostingStrategy)

# SAVE THE MODEL IN BLOB STORAGE


val datestamp = Calendar.getInstance().getTime().toString.replaceAll(" ",
".").replaceAll(":", "_");
val modelName = "GBT_Classification__"
val filename = modelDir.concat(modelName).concat(datestamp)
gbtModel.save(sc, filename);

# EVALUATE THE MODEL ON TEST INSTANCES AND THE COMPUTE TEST ERROR
val labelAndPreds = indexedTESTbinary.map { point =>
val prediction = gbtModel.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
indexedTRAINbinary.count()
//println("Learned classification GBT model:\n" + gbtModel.toDebugString)
println("Test Error = " + testErr)

# USE BINARY AND MULTICLASS METRICS TO EVALUATE THE MODEL ON THE TEST DATA
val metrics = new MulticlassMetrics(labelAndPreds)
println(s"Precision: ${metrics.precision}")
println(s"Recall: ${metrics.recall}")
println(s"F1 Score: ${metrics.fMeasure}")

val metrics = new BinaryClassificationMetrics(labelAndPreds)


println(s"Area under PR curve: ${metrics.areaUnderPR}")
println(s"Area under ROC curve: ${metrics.areaUnderROC}")

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

# PRINT THE ROC METRIC


println(s"Area under ROC curve: ${metrics.areaUnderROC}")
Output:

Area under ROC curve: 0.9846895479241554

Regression model: Predict tip amount


In this section, you create two types of regression models to predict the tip amount:

A regularized linear regression model by using the Spark ML LinearRegression()


function. You'll save the model and evaluate the model on test data.
A gradient-boosting tree regression model by using the Spark ML GBTRegressor()
function.

Create a regularized linear regression model


Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# CREATE A REGULARIZED LINEAR REGRESSION MODEL BY USING THE SPARK ML


FUNCTION AND DATA FRAMES
val lr = new
LinearRegression().setLabelCol("tip_amount").setFeaturesCol("features").setM
axIter(10).setRegParam(0.3).setElasticNetParam(0.8)

# FIT THE MODEL BY USING DATA FRAMES


val lrModel = lr.fit(OneHotTRAIN)
println(s"Coefficients: ${lrModel.coefficients} Intercept:
${lrModel.intercept}")

# SUMMARIZE THE MODEL OVER THE TRAINING SET AND PRINT METRICS
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: ${trainingSummary.objectiveHistory.toList}")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")

# SAVE THE MODEL IN AZURE BLOB STORAGE


val datestamp = Calendar.getInstance().getTime().toString.replaceAll(" ",
".").replaceAll(":", "_");
val modelName = "LinearRegression__"
val filename = modelDir.concat(modelName).concat(datestamp)
lrModel.save(filename);

# PRINT THE COEFFICIENTS


println(s"Coefficients: ${lrModel.coefficients} Intercept:
${lrModel.intercept}")
# SCORE THE MODEL ON TEST DATA
val predictions = lrModel.transform(OneHotTEST)

# EVALUATE THE MODEL ON TEST DATA


val evaluator = new
RegressionEvaluator().setLabelCol("tip_amount").setPredictionCol("prediction
").setMetricName("r2")
val r2 = evaluator.evaluate(predictions)
println("R-sqr on test data = " + r2)

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

Output:

Time to run the cell: 13 seconds.

Scala

# LOAD A SAVED LINEAR REGRESSION MODEL FROM BLOB STORAGE AND SCORE A TEST
DATA SET

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# LOAD A SAVED LINEAR REGRESSION MODEL FROM AZURE BLOB STORAGE


val savedModel =
org.apache.spark.ml.regression.LinearRegressionModel.load(filename)
println(s"Coefficients: ${savedModel.coefficients} Intercept:
${savedModel.intercept}")

# SCORE THE MODEL ON TEST DATA


val predictions =
savedModel.transform(OneHotTEST).select("tip_amount","prediction")
predictions.registerTempTable("testResults")

# EVALUATE THE MODEL ON TEST DATA


val evaluator = new
RegressionEvaluator().setLabelCol("tip_amount").setPredictionCol("prediction
").setMetricName("r2")
val r2 = evaluator.evaluate(predictions)
println("R-sqr on test data = " + r2)

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");
# PRINT THE RESULTS
println("R-sqr on test data = " + r2)

Output:

R-sqr on test data = 0.5960320470835743

Next, query the test results as a data frame and use AutoVizWidget and matplotlib to
visualize it.

SQL

# RUN A SQL QUERY


%%sql -q -o sqlResults
select * from testResults

# RUN THE CODE LOCALLY ON THE JUPYTER SERVER


%%local

# USE THE JUPYTER AUTO-PLOTTING FEATURE TO CREATE INTERACTIVE FIGURES


# CLICK THE TYPE OF PLOT TO GENERATE (LINE, AREA, BAR, AND SO ON)
sqlResults

The code creates a local data frame from the query output and plots the data. The
%%local magic creates a local data frame, sqlResults , which you can use to plot with
matplotlib.

7 Note

This Spark magic is used multiple times in this article. If the amount of data is large,
you should sample to create a data frame that can fit in local memory.

Create plots by using Python matplotlib.

Scala

# RUN THE CODE LOCALLY ON THE JUPYTER SERVER AND IMPORT LIBRARIES
%%local
sqlResults
%matplotlib inline
import numpy as np

# PLOT THE RESULTS


ax = sqlResults.plot(kind='scatter', figsize = (6,6), x='tip_amount',
y='prediction', color='blue', alpha = 0.25, label='Actual vs. predicted');
fit = np.polyfit(sqlResults['tip_amount'], sqlResults['prediction'], deg=1)
ax.set_title('Actual vs. Predicted Tip Amounts ($)')
ax.set_xlabel("Actual")
ax.set_ylabel("Predicted")
#ax.plot(sqlResults['tip_amount'], fit[0] * sqlResults['prediction'] +
fit[1], color='magenta')
plt.axis([-1, 15, -1, 8])
plt.show(ax)

Output:

Create a GBT regression model


Create a GBT regression model by using the Spark ML GBTRegressor() function, and
then evaluate the model on test data.

Gradient-boosted trees (GBTS) are ensembles of decision trees. GBTS trains decision
trees iteratively to minimize a loss function. You can use GBTS for regression and
classification. They can handle categorical features, do not require feature scaling, and
can capture nonlinearities and feature interactions. You also can use them in a
multiclass-classification setting.

Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# TRAIN A GBT REGRESSION MODEL


val gbt = new
GBTRegressor().setLabelCol("label").setFeaturesCol("featuresCat").setMaxIter
(10)
val gbtModel = gbt.fit(indexedTRAINwithCatFeat)

# MAKE PREDICTIONS
val predictions = gbtModel.transform(indexedTESTwithCatFeat)

# COMPUTE TEST SET R2


val evaluator = new
RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").se
tMetricName("r2")
val Test_R2 = evaluator.evaluate(predictions)

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

# PRINT THE RESULTS


println("Test R-sqr is: " + Test_R2);

Output:

Test R-sqr is: 0.7655383534596654

Advanced modeling utilities for optimization


In this section, you use machine learning utilities that developers frequently use for
model optimization. Specifically, you can optimize machine learning models three
different ways by using parameter sweeping and cross-validation:

Split the data into train and validation sets, optimize the model by using hyper-
parameter sweeping on a training set, and evaluate on a validation set (linear
regression)
Optimize the model by using cross-validation and hyper-parameter sweeping by
using Spark ML's CrossValidator function (binary classification)
Optimize the model by using custom cross-validation and parameter-sweeping
code to use any machine learning function and parameter set (linear regression)

Cross-validation is a technique that assesses how well a model trained on a known set
of data will generalize to predict the features of data sets on which it has not been
trained. The general idea behind this technique is that a model is trained on a data set
of known data, and then the accuracy of its predictions is tested against an independent
data set. A common implementation is to divide a data set into k-folds, and then train
the model in a round-robin fashion on all but one of the folds.
Hyper-parameter optimization is the problem of choosing a set of hyper-parameters
for a learning algorithm, usually with the goal of optimizing a measure of the
algorithm's performance on an independent data set. A hyper-parameter is a value that
you must specify outside the model training procedure. Assumptions about hyper-
parameter values can affect the flexibility and accuracy of the model. Decision trees have
hyper-parameters, for example, such as the desired depth and number of leaves in the
tree. You must set a misclassification penalty term for a support vector machine (SVM).

A common way to perform hyper-parameter optimization is to use a grid search, also


called a parameter sweep. In a grid search, an exhaustive search is performed through
the values of a specified subset of the hyper-parameter space for a learning algorithm.
Cross-validation can supply a performance metric to sort out the optimal results
produced by the grid search algorithm. If you use cross-validation hyper-parameter
sweeping, you can help limit problems like overfitting a model to training data. This way,
the model retains the capacity to apply to the general set of data from which the
training data was extracted.

Optimize a linear regression model with hyper-parameter


sweeping
Next, split data into train and validation sets, use hyper-parameter sweeping on a
training set to optimize the model, and evaluate on a validation set (linear regression).

Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# RENAME `tip_amount` AS A LABEL


val OneHotTRAINLabeled =
OneHotTRAIN.select("tip_amount","features").withColumnRenamed(existingName="
tip_amount",newName="label")
val OneHotTESTLabeled =
OneHotTEST.select("tip_amount","features").withColumnRenamed(existingName="t
ip_amount",newName="label")
OneHotTRAINLabeled.cache()
OneHotTESTLabeled.cache()

# DEFINE THE ESTIMATOR FUNCTION: `THE LinearRegression()` FUNCTION


val lr = new
LinearRegression().setLabelCol("label").setFeaturesCol("features").setMaxIte
r(10)

# DEFINE THE PARAMETER GRID


val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01,
0.001)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.1, 0.5,
0.9)).build()
# DEFINE THE PIPELINE WITH A TRAIN/TEST VALIDATION SPLIT (75% IN THE
TRAINING SET), AND THEN THE SPECIFY ESTIMATOR, EVALUATOR, AND PARAMETER GRID
val trainPct = 0.75
val trainValidationSplit = new
TrainValidationSplit().setEstimator(lr).setEvaluator(new
RegressionEvaluator).setEstimatorParamMaps(paramGrid).setTrainRatio(trainPct
)

# RUN THE TRAIN VALIDATION SPLIT AND CHOOSE THE BEST SET OF PARAMETERS
val model = trainValidationSplit.fit(OneHotTRAINLabeled)

# MAKE PREDICTIONS ON THE TEST DATA BY USING THE MODEL WITH THE COMBINATION
OF PARAMETERS THAT PERFORMS THE BEST
val testResults = model.transform(OneHotTESTLabeled).select("label",
"prediction")

# COMPUTE TEST SET R2


val evaluator = new
RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").se
tMetricName("r2")
val Test_R2 = evaluator.evaluate(testResults)

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

println("Test R-sqr is: " + Test_R2);

Output:

Test R-sqr is: 0.6226484708501209

Optimize the binary classification model by using cross-


validation and hyper-parameter sweeping
This section shows you how to optimize a binary classification model by using cross-
validation and hyper-parameter sweeping. This uses the Spark ML CrossValidator
function.

Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# CREATE DATA FRAMES WITH PROPERLY LABELED COLUMNS TO USE WITH THE TRAIN AND
TEST SPLIT
val indexedTRAINwithCatFeatBinTargetRF =
indexedTRAINwithCatFeatBinTarget.select("labelBin","featuresCat").withColumn
Renamed(existingName="labelBin",newName="label").withColumnRenamed(existingN
ame="featuresCat",newName="features")
val indexedTESTwithCatFeatBinTargetRF =
indexedTESTwithCatFeatBinTarget.select("labelBin","featuresCat").withColumnR
enamed(existingName="labelBin",newName="label").withColumnRenamed(existingNa
me="featuresCat",newName="features")
indexedTRAINwithCatFeatBinTargetRF.cache()
indexedTESTwithCatFeatBinTargetRF.cache()

# DEFINE THE ESTIMATOR FUNCTION


val rf = new
RandomForestClassifier().setLabelCol("label").setFeaturesCol("features").set
Impurity("gini").setSeed(1234).setFeatureSubsetStrategy("auto").setMaxBins(3
2)

# DEFINE THE PARAMETER GRID


val paramGrid = new ParamGridBuilder().addGrid(rf.maxDepth,
Array(4,8)).addGrid(rf.numTrees,
Array(5,10)).addGrid(rf.minInstancesPerNode, Array(100,300)).build()

# SPECIFY THE NUMBER OF FOLDS


val numFolds = 3

# DEFINE THE TRAIN/TEST VALIDATION SPLIT (75% IN THE TRAINING SET)


val CrossValidator = new CrossValidator().setEstimator(rf).setEvaluator(new
BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(
numFolds)

# RUN THE TRAIN VALIDATION SPLIT AND CHOOSE THE BEST SET OF PARAMETERS
val model = CrossValidator.fit(indexedTRAINwithCatFeatBinTargetRF)

# MAKE PREDICTIONS ON THE TEST DATA BY USING THE MODEL WITH THE COMBINATION
OF PARAMETERS THAT PERFORMS THE BEST
val testResults =
model.transform(indexedTESTwithCatFeatBinTargetRF).select("label",
"prediction")

# COMPUTE THE TEST F1 SCORE


val evaluator = new
MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("p
rediction").setMetricName("f1")
val Test_f1Score = evaluator.evaluate(testResults)

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

Output:

Time to run the cell: 33 seconds.


Optimize the linear regression model by using custom
cross-validation and parameter-sweeping code
Next, optimize the model by using custom code, and identify the best model
parameters by using the criterion of highest accuracy. Then, create the final model,
evaluate the model on test data, and save the model in Blob storage. Finally, load the
model, score test data, and evaluate accuracy.

Scala

# RECORD THE START TIME


val starttime = Calendar.getInstance().getTime()

# DEFINE THE PARAMETER GRID AND THE NUMBER OF FOLDS


val paramGrid = new ParamGridBuilder().addGrid(rf.maxDepth,
Array(5,10)).addGrid(rf.numTrees, Array(10,25,50)).build()

val nFolds = 3
val numModels = paramGrid.size
val numParamsinGrid = 2

# SPECIFY THE NUMBER OF CATEGORIES FOR CATEGORICAL VARIABLES


val categoricalFeaturesInfo = Map[Int, Int]((0,2),(1,2),(2,6),(3,4))

var maxDepth = -1
var numTrees = -1
var param = ""
var paramval = -1
var validateLB = -1.0
var validateUB = -1.0
val h = 1.0 / nFolds;
val RMSE = Array.fill(numModels)(0.0)

# CREATE K-FOLDS
val splits = MLUtils.kFold(indexedTRAINbinary, numFolds = nFolds, seed=1234)

# LOOP THROUGH K-FOLDS AND THE PARAMETER GRID TO GET AND IDENTIFY THE BEST
PARAMETER SET BY LEVEL OF ACCURACY
for (i <- 0 to (nFolds-1)) {
validateLB = i * h
validateUB = (i + 1) * h
val validationCV = trainData.filter($"rand" >= validateLB && $"rand" <
validateUB)
val trainCV = trainData.filter($"rand" < validateLB || $"rand" >=
validateUB)
val validationLabPt = validationCV.rdd.map(r =>
LabeledPoint(r.getDouble(targetIndRegression(0).toInt),
Vectors.dense(featuresIndIndex.map(r.getDouble(_)).toArray)));
val trainCVLabPt = trainCV.rdd.map(r =>
LabeledPoint(r.getDouble(targetIndRegression(0).toInt),
Vectors.dense(featuresIndIndex.map(r.getDouble(_)).toArray)));
validationLabPt.cache()
trainCVLabPt.cache()

for (nParamSets <- 0 to (numModels-1)) {


for (nParams <- 0 to (numParamsinGrid-1)) {
param =
paramGrid(nParamSets).toSeq(nParams).param.toString.split("__")(1)
paramval =
paramGrid(nParamSets).toSeq(nParams).value.toString.toInt
if (param == "maxDepth") {maxDepth = paramval}
if (param == "numTrees") {numTrees = paramval}
}
val rfModel = RandomForest.trainRegressor(trainCVLabPt,
categoricalFeaturesInfo=categoricalFeaturesInfo,
numTrees=numTrees,
maxDepth=maxDepth,

featureSubsetStrategy="auto",impurity="variance", maxBins=32)
val labelAndPreds = validationLabPt.map { point =>
val prediction =
rfModel.predict(point.features)
( prediction, point.label
)
}
val validMetrics = new RegressionMetrics(labelAndPreds)
val rmse = validMetrics.rootMeanSquaredError
RMSE(nParamSets) += rmse
}
validationLabPt.unpersist();
trainCVLabPt.unpersist();
}
val minRMSEindex = RMSE.indexOf(RMSE.min)

# GET THE BEST PARAMETERS FROM A CROSS-VALIDATION AND PARAMETER SWEEP


var best_maxDepth = -1
var best_numTrees = -1
for (nParams <- 0 to (numParamsinGrid-1)) {
param =
paramGrid(minRMSEindex).toSeq(nParams).param.toString.split("__")(1)
paramval = paramGrid(minRMSEindex).toSeq(nParams).value.toString.toInt
if (param == "maxDepth") {best_maxDepth = paramval}
if (param == "numTrees") {best_numTrees = paramval}
}

# CREATE THE BEST MODEL WITH THE BEST PARAMETERS AND A FULL TRAINING DATA
SET
val best_rfModel = RandomForest.trainRegressor(indexedTRAINreg,
categoricalFeaturesInfo=categoricalFeaturesInfo,
numTrees=best_numTrees,
maxDepth=best_maxDepth,

featureSubsetStrategy="auto",impurity="variance", maxBins=32)

# SAVE THE BEST RANDOM FOREST MODEL IN BLOB STORAGE


val datestamp = Calendar.getInstance().getTime().toString.replaceAll(" ",
".").replaceAll(":", "_");
val modelName = "BestCV_RF_Regression__"
val filename = modelDir.concat(modelName).concat(datestamp)
best_rfModel.save(sc, filename);

# PREDICT ON THE TRAINING SET WITH THE BEST MODEL AND THEN EVALUATE
val labelAndPreds = indexedTESTreg.map { point =>
val prediction =
best_rfModel.predict(point.features)
( prediction, point.label )
}

val test_rmse = new RegressionMetrics(labelAndPreds).rootMeanSquaredError


val test_rsqr = new RegressionMetrics(labelAndPreds).r2

# GET THE TIME TO RUN THE CELL


val endtime = Calendar.getInstance().getTime()
val elapsedtime = ((endtime.getTime() -
starttime.getTime())/1000).toString;
println("Time taken to run the above cell: " + elapsedtime + " seconds.");

# LOAD THE MODEL


val savedRFModel = RandomForestModel.load(sc, filename)

val labelAndPreds = indexedTESTreg.map { point =>


val prediction =
savedRFModel.predict(point.features)
( prediction, point.label )
}
# TEST THE MODEL
val test_rmse = new RegressionMetrics(labelAndPreds).rootMeanSquaredError
val test_rsqr = new RegressionMetrics(labelAndPreds).r2

Output:

Time to run the cell: 61 seconds.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Feature engineering in machine learning
Article • 01/06/2023

7 Note

This item is under maintenance. We encourage you to use the Azure Machine
Learning designer .

) Important

Support for Machine Learning Studio (classic) will end on 31 August 2024. We
recommend you transition to Azure Machine Learning by that date.

Beginning 1 December 2021, you will not be able to create new Machine Learning
Studio (classic) resources. Through 31 August 2024, you can continue to use the
existing Machine Learning Studio (classic) resources.

See information on moving machine learning projects from ML Studio


(classic) to Azure Machine Learning.
Learn more about Azure Machine Learning

ML Studio (classic) documentation is being retired and may not be updated in the
future.

In this article, you learn about feature engineering and its role in enhancing data in
machine learning. Learn from illustrative examples drawn from Azure Machine Learning
Studio (classic) experiments.

Feature engineering: The process of creating new features from raw data to
increase the predictive power of the learning algorithm. Engineered features
should capture additional information that is not easily apparent in the original
feature set.
Feature selection: The process of selecting the key subset of features to reduce the
dimensionality of the training problem.

Normally feature engineering is applied first to generate additional features, and then
feature selection is done to eliminate irrelevant, redundant, or highly correlated
features.
Feature engineering and selection are part of the modeling stage of the Team Data
Science Process (TDSP). To learn more about the TDSP and the data science lifecycle, see
What is the TDSP?

What is feature engineering?


Training data consists of a matrix composed of rows and columns. Each row in the
matrix is an observation or record. The columns of each row are the features that
describe each record. The features specified in the experimental design should
characterize the patterns in the data.

Although many of the raw data fields can be used directly to train a model, it's often
necessary to create additional (engineered) features for an enhanced training dataset.

Engineered features that enhance training provide information that better differentiates
the patterns in the data. But this process is something of an art. Sound and productive
decisions often require domain expertise.

Example 1: Add temporal features for a


regression model
Let's use the experiment Demand forecasting of bikes rentals in Azure Machine
Learning Studio (classic) to demonstrate how to engineer features for a regression task.
The objective of this experiment is to predict the demand for bike rentals within a
specific month/day/hour.

Bike rental dataset


The Bike Rental UCI dataset is based on real data from a bike share company based in
the United States. It represents the number of bike rentals within a specific hour of a day
for the years 2011 and 2012. It contains 17,379 rows and 17 columns.

The raw feature set contains weather conditions (temperature/humidity/wind speed)


and the type of the day (holiday/weekday). The field to predict is the count, which
represents the bike rentals within a specific hour. Count ranges from 1 to 977.

Create a feature engineering experiment


With the goal of constructing effective features in the training data, four regression
models are built using the same algorithm but with four different training datasets. The
four datasets represent the same raw input data, but with an increasing number of
features set. These features are grouped into four categories:

1. A = weather + holiday + weekday + weekend features for the predicted day


2. B = number of bikes that were rented in each of the previous 12 hours
3. C = number of bikes that were rented in each of the previous 12 days at the same
hour
4. D = number of bikes that were rented in each of the previous 12 weeks at the
same hour and the same day

Besides feature set A, which already exists in the original raw data, the other three sets
of features are created through the feature engineering process. Feature set B captures
recent demand for the bikes. Feature set C captures the demand for bikes at a particular
hour. Feature set D captures demand for bikes at particular hour and particular day of
the week. The four training datasets each includes feature set A, A+B, A+B+C, and
A+B+C+D, respectively.

Feature engineering using Studio (classic)


In the Studio (classic) experiment, these four training datasets are formed via four
branches from the pre-processed input dataset. Except for the leftmost branch, each of
these branches contains an Execute R Script module, in which the derived features
(feature set B, C, and D) are constructed and appended to the imported dataset.

The following figure demonstrates the R script used to create feature set B in the second
left branch.

Results
A comparison of the performance results of the four models is summarized in the
following table:
The best results are shown by features A+B+C. The error rate decreases when additional
feature set are included in the training data. It verifies the presumption that the feature
set B, C provide additional relevant information for the regression task. But adding the D
feature does not seem to provide any additional reduction in the error rate.

Example 2: Create features for text mining


Feature engineering is widely applied in tasks related to text mining such as document
classification and sentiment analysis. Since individual pieces of raw text usually serve as
the input data, the feature engineering process is needed to create the features
involving word/phrase frequencies.

Feature hashing
To achieve this task, a technique called feature hashing is applied to efficiently turn
arbitrary text features into indices. Instead of associating each text feature
(words/phrases) to a particular index, this method applies a hash function to the
features and using their hash values as indices directly.

In Studio (classic), there is a Feature Hashing module that creates word/phrase features
conveniently. Following figure shows an example of using this module. The input
dataset contains two columns: the book rating ranging from 1 to 5, and the actual
review content. The goal of this module is to retrieve a bunch of new features that show
the occurrence frequency of the corresponding word(s)/phrase(s) within the particular
book review. To use this module, complete the following steps:

First, select the column that contains the input text ("Col2" in this example).
Second, set the "Hashing bitsize" to 8, which means 2^8=256 features will be
created. The word/phase in all the text will be hashed to 256 indices. The
parameter "Hashing bitsize" ranges from 1 to 31. The word(s)/phrase(s) are less
likely to be hashed into the same index if setting it to be a larger number.
Third, set the parameter "N-grams" to 2. This value gets the occurrence frequency
of unigrams (a feature for every single word) and bigrams (a feature for every pair
of adjacent words) from the input text. The parameter "N-grams" ranges from 0 to
10, which indicates the maximum number of sequential words to be included in a
feature.

The following figure shows what these new feature look like.

Conclusion
Engineered and selected features increase the efficiency of the training process, which
attempts to extract the key information contained in the data. They also improve the
power of these models to classify the input data accurately and to predict outcomes of
interest more robustly.

Feature engineering and selection can also combine to make the learning more
computationally tractable. It does so by enhancing and then reducing the number of
features needed to calibrate or train a model. Mathematically, the selected features are a
minimal set of independent variables that explain the patterns in the data and predict
outcomes successfully.

It's not always necessarily to perform feature engineering or feature selection. It


depends on the data, the algorithm selected, and the objective of the experiment.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
To create features for data in specific environments, see the following articles:

Create features for data in SQL Server


Create features for data in a Hadoop cluster using Hive queries

Related resources
Feature selection in the Team Data Science Process
Modeling stage of the Team Data Science Process lifecycle
What is the Team Data Science Process?
Create features for data in SQL Server
using SQL and Python
Article • 05/30/2023

This document shows how to generate features for data stored in a SQL Server VM on
Azure that help algorithms learn more efficiently from the data. You can use SQL or a
programming language like Python to accomplish this task. Both approaches are
demonstrated here.

This task is a step in the Team Data Science Process (TDSP).

7 Note

For a practical example, you can consult the NYC Taxi dataset and refer to the
IPNB titled NYC Data wrangling using IPython Notebook and SQL Server for an
end-to-end walk-through.

Prerequisites
This article assumes that you have:

Created an Azure storage account. If you need instructions, see Create an Azure
Storage account
Stored your data in SQL Server. If you have not, see Move data to an Azure SQL
Database for Azure Machine Learning for instructions on how to move the data
there.

Feature generation with SQL


In this section, we describe ways of generating features using SQL:

Count based Feature Generation


Binning Feature Generation
Rolling out the features from a single column

7 Note

Once you generate additional features, you can either add them as columns to the
existing table or create a new table with the additional features and primary key,
that can be joined with the original table.

Count based feature generation


This document demonstrates two ways of generating count features. The first method
uses conditional sum and the second method uses the where clause. These new features
can then be joined with the original table (using primary key columns) to have count
features alongside the original data.

SQL

select <column_name1>,<column_name2>,<column_name3>, COUNT(*) as


Count_Features from <tablename> group by <column_name1>,<column_name2>,
<column_name3>

select <column_name1>,<column_name2> , sum(1) as Count_Features from


<tablename>
where <column_name3> = '<some_value>' group by <column_name1>,<column_name2>

Binning Feature Generation


The following example shows how to generate binned features by binning (using five
bins) a numerical column that can be used as a feature instead:

SQL

SELECT <column_name>, NTILE(5) OVER (ORDER BY <column_name>) AS BinNumber


from <tablename>

Rolling out the features from a single column


In this section, we demonstrate how to roll out a single column in a table to generate
additional features. The example assumes that there is a latitude or longitude column in
the table from which you are trying to generate features.

Here is a brief primer on latitude and longitude location data from Stack Overflow .
Here are some useful things to understand about location data before creating features
from the field:

The sign indicates whether we are north or south, east or west on the globe.
A nonzero hundreds digit indicates longitude, not latitude is being used.
The tens digit gives a position to about 1,000 kilometers. It gives useful
information about what continent or ocean we are on.
The units digit (one decimal degree) gives a position up to 111 kilometers (60
nautical miles, about 69 miles). It indicates, roughly, what large state or
country/region we are in.
The first decimal place is worth up to 11.1 km: it can distinguish the position of one
large city from a neighboring large city.
The second decimal place is worth up to 1.1 km: it can separate one village from
the next.
The third decimal place is worth up to 110 m: it can identify a large agricultural
field or institutional campus.
The fourth decimal place is worth up to 11 m: it can identify a parcel of land. It is
comparable to the typical accuracy of an uncorrected GPS unit with no
interference.
The fifth decimal place is worth up to 1.1 m: it distinguishes trees from each other.
Accuracy to this level with commercial GPS units can only be achieved with
differential correction.
The sixth decimal place is worth up to 0.11 m: you can use this level for laying out
structures in detail, for designing landscapes, building roads. It should be more
than good enough for tracking movements of glaciers and rivers. This goal can be
achieved by taking painstaking measures with GPS, such as differentially corrected
GPS.

The location information can be featurized by separating out region, location, and city
information. Once can also call a REST endpoint, such as Bing Maps API (see
https://msdn.microsoft.com/library/ff701710.aspx to get the region/district

information).

SQL

select
<location_columnname>
,round(<location_columnname>,0) as l1
,l2=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 1 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),1,1) else '0' end
,l3=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 2 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),2,1) else '0' end
,l4=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 3 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),3,1) else '0' end
,l5=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 4 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),4,1) else '0' end
,l6=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 5 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),5,1) else '0' end
,l7=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 6 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),6,1) else '0' end
from <tablename>

These location-based features can be further used to generate additional count features
as described earlier.

 Tip

You can programmatically insert the records using your language of choice. You
may need to insert the data in chunks to improve write efficiency. Here is an
example of how to do this using pyodbc . Another alternative is to insert data in
the database using BCP utility

Connecting to Azure Machine Learning


The newly generated feature can be added as a column to an existing table or stored in
a new table and joined with the original table for machine learning. The resulting table
may be then saved as a dataset in Azure Machine Learning and available for data
science.

Using a programming language like Python


Using Python to generate features when the data is in SQL Server is similar to
processing data in Azure blob using Python. For comparison, see Process Azure Blob
data in your data science environment. Load the data from the database into a pandas
data frame to process it further. The process of connecting to the database and loading
the data into the data frame is documented in this section.

The following connection string format can be used to connect to a SQL Server
database from Python using pyodbc (replace servername, dbname, username, and
password with your specific values):
Python

#Set up the SQL Azure connection


import pyodbc
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=
<dbname>;UID=<username>;PWD=<password>')

The Pandas library in Python provides a rich set of data structures and data analysis
tools for data manipulation for Python programming. The following code reads the
results returned from a SQL Server database into a Pandas data frame:

Python

# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select <columnname1>, <columnname2>... from
<tablename>''', conn)

Now you may work with the Pandas data frame as covered in topics Create features for
Azure blob storage data using Panda.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Create features for data in a Hadoop
cluster using Hive queries
Article • 11/21/2022

This document shows how to create features for data stored in an Azure HDInsight
Hadoop cluster using Hive queries. These Hive queries use embedded Hive User-
Defined Functions (UDFs), the scripts for which are provided.

The operations needed to create features can be memory intensive. The performance of
Hive queries becomes more critical in such cases and can be improved by tuning certain
parameters. The tuning of these parameters is discussed in the final section.

Examples of the queries that are presented are specific to the NYC Taxi Trip Data
scenarios are also provided in GitHub repository . These queries already have data
schema specified and are ready to be submitted to run. In the final section, parameters
that users can tune so that the performance of Hive queries can be improved are also
discussed.

This task is a step in the Team Data Science Process (TDSP).

Prerequisites
This article assumes that you have:

Created an Azure storage account. If you need instructions, see Create an Azure
Storage account
Provisioned a customized Hadoop cluster with the HDInsight service. If you need
instructions, see Customize Azure HDInsight Hadoop Clusters for Advanced
Analytics.
The data has been uploaded to Hive tables in Azure HDInsight Hadoop clusters. If
it has not, follow Create and load data to Hive tables to upload data to Hive tables
first.
Enabled remote access to the cluster. If you need instructions, see Access the Head
Node of Hadoop Cluster.

Feature generation
In this section, several examples of the ways in which features can be generating using
Hive queries are described. Once you have generated additional features, you can either
add them as columns to the existing table or create a new table with the additional
features and primary key, which can then be joined with the original table. Here are the
examples presented:

1. Frequency-based Feature Generation


2. Risks of Categorical Variables in Binary Classification
3. Extract features from Datetime Field
4. Extract features from Text Field
5. Calculate distance between GPS coordinates

Frequency-based feature generation


It is often useful to calculate the frequencies of the levels of a categorical variable, or the
frequencies of certain combinations of levels from multiple categorical variables. Users
can use the following script to calculate these frequencies:

HiveQL

select
a.<column_name1>, a.<column_name2>, a.sub_count/sum(a.sub_count) over ()
as frequency
from
(
select
<column_name1>,<column_name2>, count(*) as sub_count
from <databasename>.<tablename> group by <column_name1>, <column_name2>
)a
order by frequency desc;

Risks of categorical variables in binary classification


In binary classification, non-numeric categorical variables must be converted into
numeric features when the models being used only take numeric features. This
conversion is done by replacing each non-numeric level with a numeric risk. This section
shows some generic Hive queries that calculate the risk values (log odds) of a
categorical variable.

HiveQL

set smooth_param1=1;
set smooth_param2=20;
select
<column_name1>,<column_name2>,
ln((sum_target+${hiveconf:smooth_param1})/(record_count-
sum_target+${hiveconf:smooth_param2}-${hiveconf:smooth_param1})) as risk
from
(
select
<column_nam1>, <column_name2>, sum(binary_target) as sum_target,
sum(1) as record_count
from
(
select
<column_name1>, <column_name2>, if(target_column>0,1,0) as
binary_target
from <databasename>.<tablename>
)a
group by <column_name1>, <column_name2>
)b

In this example, variables smooth_param1 and smooth_param2 are set to smooth the risk
values calculated from the data. Risks have a range between -Inf and Inf. A risk > 0
indicates that the probability that the target is equal to 1 is greater than 0.5.

After the risk table is calculated, users can assign risk values to a table by joining it with
the risk table. The Hive joining query was provided in previous section.

Extract features from datetime fields


Hive comes with a set of UDFs for processing datetime fields. In Hive, the default
datetime format is 'yyyy-MM-dd 00:00:00' ('1970-01-01 12:21:32' for example). This
section shows examples that extract the day of a month, the month from a datetime
field, and other examples that convert a datetime string in a format other than the
default format to a datetime string in default format.

HiveQL

select day(<datetime field>), month(<datetime field>)


from <databasename>.<tablename>;

This Hive query assumes that the <datetime field> is in the default datetime format.

If a datetime field is not in the default format, you need to convert the datetime field
into Unix time stamp first, and then convert the Unix time stamp to a datetime string
that is in the default format. When the datetime is in default format, users can apply the
embedded datetime UDFs to extract features.

HiveQL

select from_unixtime(unix_timestamp(<datetime field>,'<pattern of the


datetime field>'))
from <databasename>.<tablename>;
In this query, if the <datetime field> has the pattern like 03/26/2015 12:04:39, the
<pattern of the datetime field>' should be 'MM/dd/yyyy HH:mm:ss' . To test it, users can
run

HiveQL

select from_unixtime(unix_timestamp('05/15/2015 09:32:10','MM/dd/yyyy


HH:mm:ss'))
from hivesampletable limit 1;

The hivesampletable in this query comes preinstalled on all Azure HDInsight Hadoop
clusters by default when the clusters are provisioned.

Extract features from text fields


When the Hive table has a text field that contains a string of words that are delimited by
spaces, the following query extracts the length of the string, and the number of words in
the string.

HiveQL

select length(<text field>) as str_len, size(split(<text field>,' ')) as


word_num
from <databasename>.<tablename>;

Calculate distances between sets of GPS coordinates


The query given in this section can be directly applied to the NYC Taxi Trip Data. The
purpose of this query is to show how to apply an embedded mathematical function in
Hive to generate features.

The fields that are used in this query are the GPS coordinates of pickup and dropoff
locations, named pickup_longitude, pickup_latitude, dropoff_longitude, and
dropoff_latitude. The queries that calculate the direct distance between the pickup and
dropoff coordinates are:

HiveQL

set R=3959;
set pi=radians(180);
select pickup_longitude, pickup_latitude, dropoff_longitude,
dropoff_latitude,
${hiveconf:R}*2*2*atan((1-sqrt(1-pow(sin((dropoff_latitude-
pickup_latitude)
*${hiveconf:pi}/180/2),2)-cos(pickup_latitude*${hiveconf:pi}/180)
*cos(dropoff_latitude*${hiveconf:pi}/180)*pow(sin((dropoff_longitude-
pickup_longitude)*${hiveconf:pi}/180/2),2)))
/sqrt(pow(sin((dropoff_latitude-
pickup_latitude)*${hiveconf:pi}/180/2),2)

+cos(pickup_latitude*${hiveconf:pi}/180)*cos(dropoff_latitude*${hiveconf:pi}
/180)*
pow(sin((dropoff_longitude-pickup_longitude)*${hiveconf:pi}/180/2),2)))
as direct_distance
from nyctaxi.trip
where pickup_longitude between -90 and 0
and pickup_latitude between 30 and 90
and dropoff_longitude between -90 and 0
and dropoff_latitude between 30 and 90
limit 10;

The mathematical equations that calculate the distance between two GPS coordinates
can be found on the Movable Type Scripts site, authored by Peter Lapisu. In this
Javascript, the function toRad() is just lat_or_lonpi/180, which converts degrees to
radians. Here, lat_or_lon is the latitude or longitude. Since Hive does not provide the
function atan2 , but provides the function atan , the atan2 function is implemented by
atan function in the above Hive query using the definition provided in Wikipedia .

A full list of Hive embedded UDFs can be found in the Built-in Functions section on the
Apache Hive wiki ).

Advanced topics: Tune Hive parameters to


improve query speed
The default parameter settings of Hive cluster might not be suitable for the Hive queries
and the data that the queries are processing. This section discusses some parameters
that users can tune to improve the performance of Hive queries. Users need to add the
parameter tuning queries before the queries of processing data.

1. Java heap space: For queries involving joining large datasets, or processing long
records, running out of heap space is one of the common errors. This error can be
avoided by setting parameters mapreduce.map.java.opts and
mapreduce.task.io.sort.mb to desired values. Here is an example:

HiveQL
set mapreduce.map.java.opts=-Xmx4096m;
set mapreduce.task.io.sort.mb=-Xmx1024m;

This parameter allocates 4-GB memory to Java heap space and also makes sorting
more efficient by allocating more memory for it. It is a good idea to play with these
allocations if there are any job failure errors related to heap space.

2. DFS block size: This parameter sets the smallest unit of data that the file system
stores. As an example, if the DFS block size is 128 MB, then any data of size less
than and up to 128 MB is stored in a single block. Data that is larger than 128 MB
is allotted extra blocks.

3. Choosing a small block size causes large overheads in Hadoop since the name
node has to process many more requests to find the relevant block pertaining to
the file. A recommended setting when dealing with gigabytes (or larger) data is:

HiveQL

set dfs.block.size=128m;

4. Optimizing join operation in Hive: While join operations in the map/reduce


framework typically take place in the reduce phase, sometimes, enormous gains
can be achieved by scheduling joins in the map phase (also called "mapjoins"). Set
this option:

HiveQL

set hive.auto.convert.join=true;

5. Specifying the number of mappers to Hive: While Hadoop allows the user to set
the number of reducers, the number of mappers is typically not be set by the user.
A trick that allows some degree of control on this number is to choose the Hadoop
variables mapred.min.split.size and mapred.max.split.size as the size of each map
task is determined by:

HiveQL

num_maps = max(mapred.min.split.size, min(mapred.max.split.size,


dfs.block.size))

Typically, the default value of:

mapred.min.split.size is 0, that of
mapred.max.split.size is Long.MAX and that of
dfs.block.size is 64 MB.

As we can see, given the data size, tuning these parameters by "setting" them
allows us to tune the number of mappers used.

6. Here are a few other more advanced options for optimizing Hive performance.
These options allow you to set the memory allocated to map and reduce tasks, and
can be useful in tweaking performance. Keep in mind that the
mapreduce.reduce.memory.mb cannot be greater than the physical memory size of
each worker node in the Hadoop cluster.

HiveQL

set mapreduce.map.memory.mb = 2048;


set mapreduce.reduce.memory.mb=6144;
set mapreduce.reduce.java.opts=-Xmx8192m;
set mapred.reduce.tasks=128;
set mapred.tasktracker.reduce.tasks.maximum=128;

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Feature selection in the Team Data
Science Process (TDSP)
Article • 11/21/2022

This article explains the purposes of feature selection and provides examples of its role
in the data enhancement process of machine learning. These examples are drawn from
Azure Machine Learning Studio.

The engineering and selection of features is one part of the Team Data Science Process
(TDSP) outlined in the article What is the Team Data Science Process?. Feature
engineering and selection are parts of the Develop features step of the TDSP.

feature engineering: This process attempts to create additional relevant features


from the existing raw features in the data, and to increase predictive power to the
learning algorithm.
feature selection: This process selects the key subset of original data features in an
attempt to reduce the dimensionality of the training problem.

Normally feature engineering is applied first to generate additional features, and then
the feature selection step is performed to eliminate irrelevant, redundant, or highly
correlated features.

Filter features from your data - feature


selection
Feature selection may be used for classification or regression tasks. The goal is to select
a subset of the features from the original dataset that reduce its dimensions by using a
minimal set of features to represent the maximum amount of variance in the data. This
subset of features is used to train the model. Feature selection serves two main
purposes.

First, feature selection often increases classification accuracy by eliminating


irrelevant, redundant, or highly correlated features.
Second, it decreases the number of features, which makes the model training
process more efficient. Efficiency is important for learners that are expensive to
train such as support vector machines.

Although feature selection does seek to reduce the number of features in the dataset
used to train the model, it is not referred to by the term "dimensionality reduction".
Feature selection methods extract a subset of original features in the data without
changing them. Dimensionality reduction methods employ engineered features that can
transform the original features and thus modify them. Examples of dimensionality
reduction methods include principal component analysis (PCA), canonical correlation
analysis, and singular value decomposition (SVD).

Among others, one widely applied category of feature selection methods in a supervised
context is called "filter-based feature selection". By evaluating the correlation between
each feature and the target attribute, these methods apply a statistical measure to
assign a score to each feature. The features are then ranked by the score, which may be
used to help set the threshold for keeping or eliminating a specific feature. Examples of
statistical measures used in these methods include Pearson correlation coefficient (PCC),
mutual information (MI), and the chi-squared test.

Azure Machine Learning Designer


One tool inside Azure Machine Learning is the designer. Azure Machine Learning
designer is a drag-and-drop interface used to train and deploy models in Azure
Machine Learning. To manage features, there are different tools available inside
designer.

The Filter Based Feature Selection component in Azure Machine Learning designer helps
you identify the columns in your input dataset that have the greatest predictive power.

The Permutation Feature Importance component in Azure Machine Learning designer


computes a set of feature importance scores for your dataset; you then use these scores
to help you determine the best features to use in a model.

Conclusion
Feature engineering and feature selection are two commonly engineering techniques to
increase training efficiency. These techniques also improve the model's power to classify
the input data accurately and to predict outcomes of interest more robustly. Feature
engineering and selection can also combine to make the learning more computationally
efficient by enhancing and then reducing the number of features needed to calibrate or
train a model. Mathematically speaking, the features selected to train the model are a
minimal set of independent variables that explain the maximum variance in the data to
predict the outcome feature.

It is not always necessarily to perform feature engineering or feature selection. Whether


it is needed or not depends on the data collected, the algorithm selected, and the
objective of the experiment.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Deploy ML models to production to
play an active role in making business
decisions
Article • 11/21/2022

Production deployment enables a model to play an active role in a business. Predictions


from a deployed model can be used for business decisions.

Production platforms
There are various approaches and platforms to put models into production. Here are a
few options:

Where to deploy models with Azure Machine Learning


Deployment of a model in SQL-server

7 Note

Prior to deployment, one has to ensure the latency of model scoring is low enough
to use in production.

7 Note

For deployment from Azure Machine Learning, see Deploy machine learning
models to Azure.

A/B testing
When multiple models are in production, A/B testing may be used to compare model
performance.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:
Mark Tabladillo | Senior Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Build and deploy a model using Azure
Synapse Analytics
Article • 05/30/2023

In this tutorial, we walk you through building and deploying a machine learning model
using Azure Synapse Analytics for a publicly available dataset -- the NYC Taxi Trips
dataset. The binary classification model constructed predicts whether or not a tip is paid for
a trip. Models include multiclass classification (whether or not there is a tip) and regression
(the distribution for the tip amounts paid).

The procedure follows the Team Data Science Process (TDSP) workflow. We show how to set
up a data science environment, how to load the data into Azure Synapse Analytics, and how
to use either Azure Synapse Analytics or a Jupyter Notebook to explore the data and
engineer features to model. We then show how to build and deploy a model with Azure
Machine Learning.

The NYC Taxi Trips dataset


The NYC Taxi Trip data consists of about 20 GB of compressed CSV files (~48 GB
uncompressed), recording more than 173 million individual trips and the fares paid for each
trip. Each trip record includes the pickup and dropoff locations and times, anonymized hack
(driver's) license number, and the medallion (taxi's unique ID) number. The data covers all
trips in the year 2013 and is provided in the following two datasets for each month:

1. The trip_data.csv file contains trip details, such as number of passengers, pickup and
dropoff points, trip duration, and trip length. Here are a few sample records:

medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_

datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latit

ude,dropoff_longitude,dropoff_latitude

89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01

15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06

00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05
18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07
23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07

23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868

2. The trip_fare.csv file contains details of the fare paid for each trip, such as payment
type, fare amount, surcharge and taxes, tips and tolls, and the total amount paid. Here
are a few sample records:

medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount,

surcharge, mta_tax, tip_amount, tolls_amount, total_amount

89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01

15:11:48,CSH,6.5,0,0.5,0,0,7

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06

00:18:35,CSH,6,0.5,0.5,0,0,7

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05
18:49:41,CSH,5.5,1,0.5,0,0,7

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07
23:54:15,CSH,5,0.5,0.5,0,0,6

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07

23:25:03,CSH,9.5,0.5,0.5,0,0,10.5

The unique key used to join trip_data and trip_fare is composed of the following three
fields:

medallion,
hack_license and
pickup_datetime.

Address three types of prediction tasks


We formulate three prediction problems based on the tip_amount to illustrate three kinds
of modeling tasks:

1. Binary classification: To predict whether or not a tip was paid for a trip, that is, a
tip_amount that is greater than $0 is a positive example, while a tip_amount of $0 is a
negative example.
2. Multiclass classification: To predict the range of tip paid for the trip. We divide the
tip_amount into five bins or classes:

Class 0 : tip_amount = $0
Class 1 : tip_amount > $0 and tip_amount <= $5

Class 2 : tip_amount > $5 and tip_amount <= $10

Class 3 : tip_amount > $10 and tip_amount <= $20

Class 4 : tip_amount > $20

3. Regression task: To predict the amount of tip paid for a trip.

Set up the Azure data science environment for


advanced analytics
To set up your Azure Data Science environment, follow these steps.

Create your own Azure blob storage account

When you provision your own Azure blob storage, choose a geo-location for your
Azure blob storage in or as close as possible to South Central US, which is where the
NYC Taxi data is stored. The data will be copied using AzCopy from the public blob
storage container to a container in your own storage account. The closer your Azure
blob storage is to South Central US, the faster this task (Step 4) will be completed.

To create your own Azure Storage account, follow the steps outlined at About Azure
Storage accounts. Be sure to make notes on the values for following storage account
credentials as they will be needed later in this walkthrough.
Storage Account Name
Storage Account Key
Container Name (which you want the data to be stored in the Azure blob storage)

Provision your Azure Synapse Analytics instance. Follow the documentation at Create and
query an Azure Synapse Analytics in the Azure portal to provision an Azure Synapse
Analytics instance. Make sure that you make notations on the following Azure Synapse
Analytics credentials that will be used in later steps.

Server Name: <server Name>.database.windows.net


SQLDW (Database) Name
Username
Password

Install Visual Studio and SQL Server Data Tools. For instructions, see Getting started with
Visual Studio 2019 for Azure Synapse Analytics.

Connect to your Azure Synapse Analytics with Visual Studio. For instructions, see steps 1
& 2 in Connect to SQL Analytics in Azure Synapse Analytics.
7 Note

Run the following SQL query on the database you created in your Azure Synapse
Analytics (instead of the query provided in step 3 of the connect topic,) to create a
master key.

SQL

BEGIN TRY
--Try to create the master key
CREATE MASTER KEY
END TRY
BEGIN CATCH
--If the master key exists, do nothing
END CATCH;

Create an Azure Machine Learning workspace under your Azure subscription. For
instructions, see Create an Azure Machine Learning workspace.

Load the data into Azure Synapse Analytics


Open a Windows PowerShell command console. Run the following PowerShell commands
to download the example SQL script files that we share with you on GitHub to a local
directory that you specify with the parameter -DestDir. You can change the value of
parameter -DestDir to any local directory. If -DestDir does not exist, it will be created by the
PowerShell script.

7 Note

You might need to Run as Administrator when executing the following PowerShell
script if your DestDir directory needs Administrator privilege to create or to write to it.

Azure PowerShell

$source = "https://raw.githubusercontent.com/Azure/Azure-MachineLearning-
DataScience/master/Misc/SQLDW/Download_Scripts_SQLDW_Walkthrough.ps1"
$ps1_dest = "$pwd\Download_Scripts_SQLDW_Walkthrough.ps1"
$wc = New-Object System.Net.WebClient
$wc.DownloadFile($source, $ps1_dest)
.\Download_Scripts_SQLDW_Walkthrough.ps1 –DestDir 'C:\tempSQLDW'

After successful execution, your current working directory changes to -DestDir. You should
be able to see screen like below:
In your -DestDir, execute the following PowerShell script in administrator mode:

Azure PowerShell

./SQLDW_Data_Import.ps1

When the PowerShell script runs for the first time, you will be asked to input the
information from your Azure Synapse Analytics and your Azure blob storage account. When
this PowerShell script completes running for the first time, the credentials you input will
have been written to a configuration file SQLDW.conf in the present working directory. The
future run of this PowerShell script file has the option to read all needed parameters from
this configuration file. If you need to change some parameters, you can choose to input the
parameters on the screen upon prompt by deleting this configuration file and inputting the
parameters values as prompted or to change the parameter values by editing the
SQLDW.conf file in your -DestDir directory.

7 Note

In order to avoid schema name conflicts with those that already exist in your Azure
Azure Synapse Analytics, when reading parameters directly from the SQLDW.conf file, a
3-digit random number is added to the schema name from the SQLDW.conf file as the
default schema name for each run. The PowerShell script may prompt you for a
schema name: the name may be specified at user discretion.

This PowerShell script file completes the following tasks:

Downloads and installs AzCopy, if AzCopy is not already installed

Azure PowerShell

$AzCopy_path = SearchAzCopy
if ($AzCopy_path -eq $null){
Write-Host "AzCopy.exe is not found in C:\Program Files*. Now,
start installing AzCopy..." -ForegroundColor "Yellow"
InstallAzCopy
$AzCopy_path = SearchAzCopy
}
$env_path = $env:Path
for ($i=0; $i -lt $AzCopy_path.count; $i++){
if ($AzCopy_path.count -eq 1){
$AzCopy_path_i = $AzCopy_path
} else {
$AzCopy_path_i = $AzCopy_path[$i]
}
if ($env_path -notlike '*' +$AzCopy_path_i+'*'){
Write-Host $AzCopy_path_i 'not in system path, add it...'
[Environment]::SetEnvironmentVariable("Path",
"$AzCopy_path_i;$env_path", "Machine")
$env:Path =
[System.Environment]::GetEnvironmentVariable("Path","Machine")
$env_path = $env:Path
}

Copies data to your private blob storage account from the public blob with AzCopy

Azure PowerShell

Write-Host "AzCopy is copying data from public blob to yo storage account.


It may take a while..." -ForegroundColor "Yellow"
$start_time = Get-Date
AzCopy.exe /Source:$Source /Dest:$DestURL /DestKey:$StorageAccountKey /S
$end_time = Get-Date
$time_span = $end_time - $start_time
$total_seconds = [math]::Round($time_span.TotalSeconds,2)
Write-Host "AzCopy finished copying data. Please check your storage
account to verify." -ForegroundColor "Yellow"
Write-Host "This step (copying data from public blob to your storage
account) takes $total_seconds seconds." -ForegroundColor "Green"

Loads data using Polybase (by executing LoadDataToSQLDW.sql) to your Azure


Synapse Analytics from your private blob storage account with the following
commands.

Create a schema

SQL

EXEC (''CREATE SCHEMA {schemaname};'');

Create a database scoped credential

SQL

CREATE DATABASE SCOPED CREDENTIAL {KeyAlias}


WITH IDENTITY = ''asbkey'' ,
Secret = ''{StorageAccountKey}''

Create an external data source for an Azure Storage blob


SQL

CREATE EXTERNAL DATA SOURCE {nyctaxi_trip_storage}


WITH
(
TYPE = HADOOP,
LOCATION
=''wasbs://{ContainerName}@{StorageAccountName}.blob.core.windows.net''
,
CREDENTIAL = {KeyAlias}
)
;

CREATE EXTERNAL DATA SOURCE {nyctaxi_fare_storage}


WITH
(
TYPE = HADOOP,
LOCATION
=''wasbs://{ContainerName}@{StorageAccountName}.blob.core.windows.net''
,
CREDENTIAL = {KeyAlias}
)
;

Create an external file format for a csv file. Data is uncompressed and fields are
separated with the pipe character.

SQL

CREATE EXTERNAL FILE FORMAT {csv_file_format}


WITH
(
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS
(
FIELD_TERMINATOR ='','',
USE_TYPE_DEFAULT = TRUE
)
)
;

Create external fare and trip tables for NYC taxi dataset in Azure blob storage.

SQL

CREATE EXTERNAL TABLE {external_nyctaxi_fare}


(
medallion varchar(50) not null,
hack_license varchar(50) not null,
vendor_id char(3),
pickup_datetime datetime not null,
payment_type char(3),
fare_amount float,
surcharge float,
mta_tax float,
tip_amount float,
tolls_amount float,
total_amount float
)
with (
LOCATION = ''/nyctaxifare/'',
DATA_SOURCE = {nyctaxi_fare_storage},
FILE_FORMAT = {csv_file_format},
REJECT_TYPE = VALUE,
REJECT_VALUE = 12
)

CREATE EXTERNAL TABLE {external_nyctaxi_trip}


(
medallion varchar(50) not null,
hack_license varchar(50) not null,
vendor_id char(3),
rate_code char(3),
store_and_fwd_flag char(3),
pickup_datetime datetime not null,
dropoff_datetime datetime,
passenger_count int,
trip_time_in_secs bigint,
trip_distance float,
pickup_longitude varchar(30),
pickup_latitude varchar(30),
dropoff_longitude varchar(30),
dropoff_latitude varchar(30)
)
with (
LOCATION = ''/nyctaxitrip/'',
DATA_SOURCE = {nyctaxi_trip_storage},
FILE_FORMAT = {csv_file_format},
REJECT_TYPE = VALUE,
REJECT_VALUE = 12
)

Load data from external tables in Azure blob storage to Azure Synapse Analytics

SQL

CREATE TABLE {schemaname}.{nyctaxi_fare}


WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(medallion)
)
AS
SELECT *
FROM {external_nyctaxi_fare}
;

CREATE TABLE {schemaname}.{nyctaxi_trip}


WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(medallion)
)
AS
SELECT *
FROM {external_nyctaxi_trip}
;

Create a sample data table (NYCTaxi_Sample) and insert data to it from selecting
SQL queries on the trip and fare tables. (Some steps of this walkthrough need to
use this sample table.)

SQL

CREATE TABLE {schemaname}.{nyctaxi_sample}


WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(medallion)
)
AS
(
SELECT t.*, f.payment_type, f.fare_amount, f.surcharge,
f.mta_tax, f.tolls_amount, f.total_amount, f.tip_amount,
tipped = CASE WHEN (tip_amount > 0) THEN 1 ELSE 0 END,
tip_class = CASE
WHEN (tip_amount = 0) THEN 0
WHEN (tip_amount > 0 AND tip_amount <= 5) THEN 1
WHEN (tip_amount > 5 AND tip_amount <= 10) THEN 2
WHEN (tip_amount > 10 AND tip_amount <= 20) THEN 3
ELSE 4
END
FROM {schemaname}.{nyctaxi_trip} t, {schemaname}.{nyctaxi_fare} f
WHERE datepart("mi",t.pickup_datetime) = 1
AND t.medallion = f.medallion
AND t.hack_license = f.hack_license
AND t.pickup_datetime = f.pickup_datetime
AND pickup_longitude <> ''0''
AND dropoff_longitude <> ''0''
)
;

The geographic location of your storage accounts affects load times.

7 Note

Depending on the geographical location of your private blob storage account, the
process of copying data from a public blob to your private storage account can take
about 15 minutes, or even longer,and the process of loading data from your storage
account to your Azure Azure Synapse Analytics could take 20 minutes or longer.

You will have to decide what do if you have duplicate source and destination files.

7 Note

If the .csv files to be copied from the public blob storage to your private blob storage
account already exist in your private blob storage account, AzCopy will ask you
whether you want to overwrite them. If you do not want to overwrite them, input n
when prompted. If you want to overwrite all of them, input a when prompted. You can
also input y to overwrite .csv files individually.

You can use your own data. If your data is in your on-premises machine in your real life
application, you can still use AzCopy to upload on-premises data to your private Azure blob
storage. You only need to change the Source location, $Source =
"http://getgoing.blob.core.windows.net/public/nyctaxidataset" , in the AzCopy command
of the PowerShell script file to the local directory that contains your data.

 Tip

If your data is already in your private Azure blob storage in your real life application,
you can skip the AzCopy step in the PowerShell script and directly upload the data to
Azure Azure Synapse Analytics. This will require additional edits of the script to tailor it
to the format of your data.

This PowerShell script also plugs in the Azure Synapse Analytics information into the data
exploration example files SQLDW_Explorations.sql, SQLDW_Explorations.ipynb, and
SQLDW_Explorations_Scripts.py so that these three files are ready to be tried out instantly
after the PowerShell script completes.

After a successful execution, you will see screen like below:

Data exploration and feature engineering in


Azure Synapse Analytics
In this section, we perform data exploration and feature generation by running SQL queries
against Azure Synapse Analytics directly using Visual Studio Data Tools. All SQL queries
used in this section can be found in the sample script named SQLDW_Explorations.sql. This
file has already been downloaded to your local directory by the PowerShell script. You can
also retrieve it from GitHub . But the file in GitHub does not have the Azure Synapse
Analytics information plugged in.

Connect to your Azure Synapse Analytics using Visual Studio with the Azure Synapse
Analytics login name and password and open up the SQL Object Explorer to confirm the
database and tables have been imported. Retrieve the SQLDW_Explorations.sql file.

7 Note

To open a Parallel Data Warehouse (PDW) query editor, use the New Query command
while your PDW is selected in the SQL Object Explorer. The standard SQL query editor
is not supported by PDW.

Here are the types of data exploration and feature generation tasks performed in this
section:

Explore data distributions of a few fields in varying time windows.


Investigate data quality of the longitude and latitude fields.
Generate binary and multiclass classification labels based on the tip_amount.
Generate features and compute/compare trip distances.
Join the two tables and extract a random sample that will be used to build models.

Data import verification


These queries provide a quick verification of the number of rows and columns in the tables
populated earlier using Polybase's parallel bulk import,

-- Report number of rows in table <nyctaxi_trip> without table scan

SQL

SELECT SUM(rows) FROM sys.partitions WHERE object_id = OBJECT_ID('<schemaname>.


<nyctaxi_trip>')

-- Report number of columns in table <nyctaxi_trip>

SQL

SELECT COUNT(*) FROM information_schema.columns WHERE table_name =


'<nyctaxi_trip>' AND table_schema = '<schemaname>'

Output: You should get 173,179,759 rows and 14 columns.

Exploration: Trip distribution by medallion


This example query identifies the medallions (taxi numbers) that completed more than 100
trips within a specified time period. The query would benefit from the partitioned table
access since it is conditioned by the partition scheme of pickup_datetime. Querying the full
dataset will also make use of the partitioned table and/or index scan.

SQL

SELECT medallion, COUNT(*)


FROM <schemaname>.<nyctaxi_fare>
WHERE pickup_datetime BETWEEN '20130101' AND '20130331'
GROUP BY medallion
HAVING COUNT(*) > 100

Output: The query should return a table with rows specifying the 13,369 medallions (taxis)
and the number of trips completed in 2013. The last column contains the count of the
number of trips completed.

Exploration: Trip distribution by medallion and hack_license


This example identifies the medallions (taxi numbers) and hack_license numbers (drivers)
that completed more than 100 trips within a specified time period.

SQL
SELECT medallion, hack_license, COUNT(*)
FROM <schemaname>.<nyctaxi_fare>
WHERE pickup_datetime BETWEEN '20130101' AND '20130131'
GROUP BY medallion, hack_license
HAVING COUNT(*) > 100

Output: The query should return a table with 13,369 rows specifying the 13,369 car/driver
IDs that have completed more that 100 trips in 2013. The last column contains the count of
the number of trips completed.

Data quality assessment: Verify records with incorrect


longitude and/or latitude
This example investigates if any of the longitude and/or latitude fields either contain an
invalid value (radian degrees should be between -90 and 90), or have (0, 0) coordinates.

SQL

SELECT COUNT(*) FROM <schemaname>.<nyctaxi_trip>


WHERE pickup_datetime BETWEEN '20130101' AND '20130331'
AND (CAST(pickup_longitude AS float) NOT BETWEEN -90 AND 90
OR CAST(pickup_latitude AS float) NOT BETWEEN -90 AND 90
OR CAST(dropoff_longitude AS float) NOT BETWEEN -90 AND 90
OR CAST(dropoff_latitude AS float) NOT BETWEEN -90 AND 90
OR (pickup_longitude = '0' AND pickup_latitude = '0')
OR (dropoff_longitude = '0' AND dropoff_latitude = '0'))

Output: The query returns 837,467 trips that have invalid longitude and/or latitude fields.

Exploration: Tipped vs. not tipped trips distribution


This example finds the number of trips that were tipped vs. the number that were not
tipped in a specified time period (or in the full dataset if covering the full year as it is set up
here). This distribution reflects the binary label distribution to be later used for binary
classification modeling.

SQL

SELECT tipped, COUNT(*) AS tip_freq FROM (


SELECT CASE WHEN (tip_amount > 0) THEN 1 ELSE 0 END AS tipped, tip_amount
FROM <schemaname>.<nyctaxi_fare>
WHERE pickup_datetime BETWEEN '20130101' AND '20131231') tc
GROUP BY tipped
Output: The query should return the following tip frequencies for the year 2013: 90,447,622
tipped and 82,264,709 not-tipped.

Exploration: Tip class/range distribution


This example computes the distribution of tip ranges in a given time period (or in the full
dataset if covering the full year). This distribution of label classes will be used later for
multiclass classification modeling.

SQL

SELECT tip_class, COUNT(*) AS tip_freq FROM (


SELECT CASE
WHEN (tip_amount = 0) THEN 0
WHEN (tip_amount > 0 AND tip_amount <= 5) THEN 1
WHEN (tip_amount > 5 AND tip_amount <= 10) THEN 2
WHEN (tip_amount > 10 AND tip_amount <= 20) THEN 3
ELSE 4
END AS tip_class
FROM <schemaname>.<nyctaxi_fare>
WHERE pickup_datetime BETWEEN '20130101' AND '20131231') tc
GROUP BY tip_class

Output:

tip_class tip_freq

1 82230915

2 6198803

3 1932223

0 82264625

4 85765

Exploration: Compute and compare trip distance


This example converts the pickup and dropoff longitude and latitude to SQL geography
points, computes the trip distance using SQL geography points difference, and returns a
random sample of the results for comparison. The example limits the results to valid
coordinates only using the data quality assessment query covered earlier.

SQL

/****** Object: UserDefinedFunction [dbo].[fnCalculateDistance] ******/


SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name =
'fnCalculateDistance')
DROP FUNCTION fnCalculateDistance
GO

-- User-defined function to calculate the direct distance in mile between two


geographical coordinates.
CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2
float, @Long2 float)

RETURNS float
AS
BEGIN
DECLARE @distance decimal(28, 10)
-- Convert to radians
SET @Lat1 = @Lat1 / 57.2958
SET @Long1 = @Long1 / 57.2958
SET @Lat2 = @Lat2 / 57.2958
SET @Long2 = @Long2 / 57.2958
-- Calculate distance
SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) *
COS(@Long2 - @Long1))
--Convert to miles
IF @distance <> 0
BEGIN
SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) /
@distance);
END
RETURN @distance
END
GO

SELECT pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude,


dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude,
dropoff_longitude) AS DirectDistance
FROM <schemaname>.<nyctaxi_trip>
WHERE datepart("mi",pickup_datetime)=1
AND CAST(pickup_latitude AS float) BETWEEN -90 AND 90
AND CAST(dropoff_latitude AS float) BETWEEN -90 AND 90
AND pickup_longitude != '0' AND dropoff_longitude != '0'

Feature engineering using SQL functions


Sometimes SQL functions can be an efficient option for feature engineering. In this
walkthrough, we defined a SQL function to calculate the direct distance between the pickup
and dropoff locations. You can run the following SQL scripts in Visual Studio Data Tools.

Here is the SQL script that defines the distance function.


SQL

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name =
'fnCalculateDistance')
DROP FUNCTION fnCalculateDistance
GO

-- User-defined function calculate the direct distance between two geographical


coordinates.
CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2
float, @Long2 float)

RETURNS float
AS
BEGIN
DECLARE @distance decimal(28, 10)
-- Convert to radians
SET @Lat1 = @Lat1 / 57.2958
SET @Long1 = @Long1 / 57.2958
SET @Lat2 = @Lat2 / 57.2958
SET @Long2 = @Long2 / 57.2958
-- Calculate distance
SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) *
COS(@Long2 - @Long1))
--Convert to miles
IF @distance <> 0
BEGIN
SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) /
@distance);
END
RETURN @distance
END
GO

Here is an example to call this function to generate features in your SQL query:

-- Sample query to call the function to create features

SQL

SELECT pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude,


dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude,
dropoff_longitude) AS DirectDistance FROM <schemaname>.<nyctaxi_trip> WHERE
datepart("mi",pickup_datetime)=1 AND CAST(pickup_latitude AS float) BETWEEN -90
AND 90 AND CAST(dropoff_latitude AS float) BETWEEN -90 AND 90 AND
pickup_longitude != '0' AND dropoff_longitude != '0'
Output: This query generates a table (with 2,803,538 rows) with pickup and dropoff
latitudes and longitudes and the corresponding direct distances in miles. Here are the
results for first three rows:

(Row pickup_latitude pickup_longitude dropoff_latitude dropoff_longitude DirectDistance


number)

1 40.731804 -74.001083 40.736622 -73.988953 .7169601222

2 40.715794 -74,010635 40.725338 -74.00399 .7448343721

3 40.761456 -73.999886 40.766544 -73.988228 0.7037227967

Prepare data for model building


The following query joins the nyctaxi_trip and nyctaxi_fare tables, generates a binary
classification label tipped, a multi-class classification label tip_class, and extracts a sample
from the full joined dataset. The sampling is done by retrieving a subset of the trips based
on pickup time. This SQL query may be copied then pasted for direct data ingestion from
the SQL Database instance in Azure. The query excludes records with incorrect (0, 0)
coordinates.

SQL

SELECT t.*, f.payment_type, f.fare_amount, f.surcharge, f.mta_tax,


f.tolls_amount, f.total_amount, f.tip_amount,
CASE WHEN (tip_amount > 0) THEN 1 ELSE 0 END AS tipped,
CASE WHEN (tip_amount = 0) THEN 0
WHEN (tip_amount > 0 AND tip_amount <= 5) THEN 1
WHEN (tip_amount > 5 AND tip_amount <= 10) THEN 2
WHEN (tip_amount > 10 AND tip_amount <= 20) THEN 3
ELSE 4
END AS tip_class
FROM <schemaname>.<nyctaxi_trip> t, <schemaname>.<nyctaxi_fare> f
WHERE datepart("mi",t.pickup_datetime) = 1
AND t.medallion = f.medallion
AND t.hack_license = f.hack_license
AND t.pickup_datetime = f.pickup_datetime
AND pickup_longitude != '0' AND dropoff_longitude != '0'

When you are ready to proceed to Azure Machine Learning, you may either:

1. Save the final SQL query to extract and sample the data and copy-paste the query
directly into a notebook in Azure Machine Learning, or
2. Persist the sampled and engineered data you plan to use for model building in a new
Azure Synapse Analytics table and access that table through a datastore in Azure
Machine Learning.
Data exploration and feature engineering in the
notebook
In this section, we will perform data exploration and feature generation using both Python
and SQL queries against the Azure Synapse Analytics created earlier. A sample notebook
named SQLDW_Explorations.ipynb and a Python script file
SQLDW_Explorations_Scripts.py have been downloaded to your local directory. They are
also available on GitHub . These two files are identical in Python scripts. The Python script
file is provided to you in case you would like to use Python without a notebook. These two
sample Python files were designed under Python 2.7.

The needed Azure Synapse Analytics information in the sample Jupyter Notebook and the
Python script file downloaded to your local machine has been plugged in by the PowerShell
script previously. They are executable without any modification.

If you have already set up an Azure Machine Learning workspace, you can directly upload
the sample Notebook to the AzureML Notebooks area. For directions on uploading a
notebook, see Run Jupyter Notebooks in your workspace

Note: In order to run the sample Jupyter Notebook or the Python script file, the following
Python packages are needed.

pandas
numpy
matplotlib
pyodbc
PyTables

When building advanced analytical solutions on Azure Machine Learning with large data,
here is the recommended sequence:

Read in a small sample of the data into an in-memory data frame.


Perform some visualizations and explorations using the sampled data.
Experiment with feature engineering using the sampled data.
For larger data exploration, data manipulation and feature engineering, use Python to
issue SQL Queries directly against the Azure Synapse Analytics.
Decide the sample size to be suitable for Azure Machine Learning model building.

The followings are a few data exploration, data visualization, and feature engineering
examples. More data explorations can be found in the sample Jupyter Notebook and the
sample Python script file.

Initialize database credentials


Initialize your database connection settings in the following variables:

SQL

SERVER_NAME=<server name>
DATABASE_NAME=<database name>
USERID=<user name>
PASSWORD=<password>
DB_DRIVER = <database driver>

Create database connection


Here is the connection string that creates the connection to the database.

SQL

CONNECTION_STRING = 'DRIVER=
{'+DRIVER+'};SERVER='+SERVER_NAME+';DATABASE='+DATABASE_NAME+';UID='+USERID+';P
WD='+PASSWORD
conn = pyodbc.connect(CONNECTION_STRING)

Report number of rows and columns in table <nyctaxi_trip>


SQL

nrows = pd.read_sql('''
SELECT SUM(rows) FROM sys.partitions
WHERE object_id = OBJECT_ID('<schemaname>.<nyctaxi_trip>')
''', conn)

print 'Total number of rows = %d' % nrows.iloc[0,0]

ncols = pd.read_sql('''
SELECT COUNT(*) FROM information_schema.columns
WHERE table_name = ('<nyctaxi_trip>') AND table_schema = ('<schemaname>')
''', conn)

print 'Total number of columns = %d' % ncols.iloc[0,0]

Total number of rows = 173179759


Total number of columns = 14

Report number of rows and columns in table


<nyctaxi_fare>
SQL
nrows = pd.read_sql('''
SELECT SUM(rows) FROM sys.partitions
WHERE object_id = OBJECT_ID('<schemaname>.<nyctaxi_fare>')
''', conn)

print 'Total number of rows = %d' % nrows.iloc[0,0]

ncols = pd.read_sql('''
SELECT COUNT(*) FROM information_schema.columns
WHERE table_name = ('<nyctaxi_fare>') AND table_schema = ('<schemaname>')
''', conn)

print 'Total number of columns = %d' % ncols.iloc[0,0]

Total number of rows = 173179759


Total number of columns = 11

Read-in a small data sample from the Azure Synapse


Analytics Database
SQL

t0 = time.time()

query = '''
SELECT TOP 10000 t.*, f.payment_type, f.fare_amount, f.surcharge,
f.mta_tax,
f.tolls_amount, f.total_amount, f.tip_amount
FROM <schemaname>.<nyctaxi_trip> t, <schemaname>.<nyctaxi_fare> f
WHERE datepart("mi",t.pickup_datetime) = 1
AND t.medallion = f.medallion
AND t.hack_license = f.hack_license
AND t.pickup_datetime = f.pickup_datetime
'''

df1 = pd.read_sql(query, conn)

t1 = time.time()
print 'Time to read the sample table is %f seconds' % (t1-t0)

print 'Number of rows and columns retrieved = (%d, %d)' % (df1.shape[0],


df1.shape[1])

Time to read the sample table is 14.096495 seconds. Number of rows and columns
retrieved = (1000, 21).

Descriptive statistics
Now you are ready to explore the sampled data. We start with looking at some descriptive
statistics for the trip_distance (or any other fields you choose to specify).

SQL

df1['trip_distance'].describe()

Visualization: Box plot example


Next we look at the box plot for the trip distance to visualize the quantiles.

SQL

df1.boxplot(column='trip_distance',return_type='dict')

Visualization: Distribution plot example


Plots that visualize the distribution and a histogram for the sampled trip distances.

SQL

fig = plt.figure()
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
df1['trip_distance'].plot(ax=ax1,kind='kde', style='b-')
df1['trip_distance'].hist(ax=ax2, bins=100, color='k')
Visualization: Bar and line plots
In this example, we bin the trip distance into five bins and visualize the binning results.

SQL

trip_dist_bins = [0, 1, 2, 4, 10, 1000]


df1['trip_distance']
trip_dist_bin_id = pd.cut(df1['trip_distance'], trip_dist_bins)
trip_dist_bin_id

We can plot the above bin distribution in a bar or line plot with:

SQL

pd.Series(trip_dist_bin_id).value_counts().plot(kind='bar')

and

SQL
pd.Series(trip_dist_bin_id).value_counts().plot(kind='line')

Visualization: Scatterplot examples


We show scatter plot between trip_time_in_secs and trip_distance to see if there is any
correlation

SQL

plt.scatter(df1['trip_time_in_secs'], df1['trip_distance'])

Similarly we can check the relationship between rate_code and trip_distance.

SQL

plt.scatter(df1['passenger_count'], df1['trip_distance'])
Data exploration on sampled data using SQL queries in
Jupyter notebook
In this section, we explore data distributions using the sampled data that is persisted in the
new table we created above. Similar explorations may be performed using the original
tables.

Exploration: Report number of rows and columns in the sampled


table

SQL

nrows = pd.read_sql('''SELECT SUM(rows) FROM sys.partitions WHERE object_id =


OBJECT_ID('<schemaname>.<nyctaxi_sample>')''', conn)
print 'Number of rows in sample = %d' % nrows.iloc[0,0]

ncols = pd.read_sql('''SELECT count(*) FROM information_schema.columns WHERE


table_name = ('<nyctaxi_sample>') AND table_schema = '<schemaname>'''', conn)
print 'Number of columns in sample = %d' % ncols.iloc[0,0]

Exploration: Tipped/not tripped Distribution

SQL

query = '''
SELECT tipped, count(*) AS tip_freq
FROM <schemaname>.<nyctaxi_sample>
GROUP BY tipped
'''

pd.read_sql(query, conn)
Exploration: Tip class distribution

SQL

query = '''
SELECT tip_class, count(*) AS tip_freq
FROM <schemaname>.<nyctaxi_sample>
GROUP BY tip_class
'''

tip_class_dist = pd.read_sql(query, conn)

Exploration: Plot the tip distribution by class

SQL

tip_class_dist['tip_freq'].plot(kind='bar')

Exploration: Daily distribution of trips

SQL

query = '''
SELECT CONVERT(date, dropoff_datetime) AS date, COUNT(*) AS c
FROM <schemaname>.<nyctaxi_sample>
GROUP BY CONVERT(date, dropoff_datetime)
'''

pd.read_sql(query,conn)

Exploration: Trip distribution per medallion

SQL
query = '''
SELECT medallion,count(*) AS c
FROM <schemaname>.<nyctaxi_sample>
GROUP BY medallion
'''

pd.read_sql(query,conn)

Exploration: Trip distribution by medallion and hack license

SQL

query = '''select medallion, hack_license,count(*) from <schemaname>.


<nyctaxi_sample> group by medallion, hack_license'''
pd.read_sql(query,conn)

Exploration: Trip time distribution

SQL

query = '''select trip_time_in_secs, count(*) from <schemaname>.


<nyctaxi_sample> group by trip_time_in_secs order by count(*) desc'''
pd.read_sql(query,conn)

Exploration: Trip distance distribution

SQL

query = '''select floor(trip_distance/5)*5 as tripbin, count(*) from


<schemaname>.<nyctaxi_sample> group by floor(trip_distance/5)*5 order by
count(*) desc'''
pd.read_sql(query,conn)

Exploration: Payment type distribution

SQL

query = '''select payment_type,count(*) from <schemaname>.<nyctaxi_sample>


group by payment_type'''
pd.read_sql(query,conn)

Verify the final form of the featurized table


SQL

query = '''SELECT TOP 100 * FROM <schemaname>.<nyctaxi_sample>'''


pd.read_sql(query,conn)

Build models in Azure Machine Learning


We are now ready to proceed to model building and model deployment in Azure Machine
Learning. The data is ready to be used in any of the prediction problems identified earlier,
namely:

1. Binary classification: To predict whether or not a tip was paid for a trip.
2. Multiclass classification: To predict the range of tip paid, according to the previously
defined classes.
3. Regression task: To predict the amount of tip paid for a trip.

To begin the modeling exercise, log in to your Azure Machine Learning workspace. If you
have not yet created a machine learning workspace, see Create the workspace.

1. To get started with Azure Machine Learning, see What is Azure Machine Learning?
2. Sign in to the Azure portal .
3. The Machine Learning Home page provides a wealth of information, videos, tutorials,
links to the Modules Reference, and other resources. For more information about
Azure Machine Learning, see the Azure Machine Learning Documentation Center.

A typical training experiment consists of the following steps:

1. Create an authoring resource (notebook or designer, for example).


2. Get the data into Azure Machine Learning.
3. Pre-process, transform, and manipulate the data as needed.
4. Generate features as needed.
5. Split the data into training/validation/testing datasets(or have separate datasets for
each).
6. Select one or more machine learning algorithms depending on the learning problem
to solve. For example, binary classification, multiclass classification, regression.
7. Train one or more models using the training dataset.
8. Score the validation dataset using the trained model(s).
9. Evaluate the model(s) to compute the relevant metrics for the learning problem.
10. Tune the model(s) and select the best model to deploy.

In this exercise, we have already explored and engineered the data in Azure Synapse
Analytics, and decided on the sample size to ingest in Azure Machine Learning. Here is the
procedure to build one or more of the prediction models:
1. Get the data into Azure Machine Learning. For details, see Data ingestion options for
Azure Machine Learning workflows
2. Connect to Synapse Analytics. For details, see Link Azure Synapse Analytics and Azure
Machine Learning workspaces and attach Apache Spark pools

) Important

In the modeling data extraction and sampling query examples provided in previous
sections, all labels for the three modeling exercises are included in the query. An
important (required) step in each of the modeling exercises is to exclude the
unnecessary labels for the other two problems, and any other target leaks. For
example, when using binary classification, use the label tipped and exclude the fields
tip_class, tip_amount, and total_amount. The latter are target leaks since they imply
the tip paid.

Deploy models in Azure Machine Learning


When your model is ready, you can easily deploy it as a web service directly from the model
stored in the experiment run. For details on the options for deployment, see Deploy
machine learning models to Azure

Summary
To recap what we have done in this walkthrough tutorial, you have created an Azure data
science environment, worked with a large public dataset, taking it through the Team Data
Science Process, all the way from data acquisition to model training, and then to the
deployment of an Azure Machine Learning web service.

License information
This sample walkthrough and its accompanying scripts and notebook(s) are shared by
Microsoft under the MIT license. Check the LICENSE.txt file in the directory of the sample
code on GitHub for more details.

References
NYC Taxi and Limousine Commission Research and Statistics
Search and query an enterprise
knowledge base by using Azure
OpenAI or Azure Cognitive Search
Azure Blob Storage Azure Cache for Redis Azure Cognitive Search Azure AI services

Azure Document Intelligence

This article describes how to use Azure OpenAI Service or Azure Cognitive Search to
search documents in your enterprise data and retrieve results to provide a ChatGPT-
style question and answer experience. This solution describes two approaches:

Embeddings approach: Use the Azure OpenAI embedding model to create


vectorized data. Vector search is a feature that significantly increases the semantic
relevance of search results.

Azure Cognitive Search approach: Use Azure Cognitive Search to search and
retrieve relevant text data based on a user query. This service supports full-text
search, semantic search, vector search, and hybrid search.

7 Note

In Azure Cognitive Search, the semantic search and vector search features are
currently in public preview.

Architecture: Embedding approach


Embedding creation Query and retrieval
5 5
Query

1
Storage Function apps Azure Cache for Azure App 1
User
accounts Redis 2 Service 4
2 4
3
Vectorize Return top k Results passed
Translate Create
Extract text query matching content with prompt
(optional) embeddings
3

Azure OpenAI Azure OpenAI


Azure Azure AI Azure OpenAI embedding language model
Translator Document embedding model
Intelligence model


Download a Visio file of this architecture.
Dataflow
Documents to be ingested can come from various sources, like files on an FTP server,
email attachments, or web application attachments. These documents can be ingested
to Azure Blob Storage via services like Azure Logic Apps, Azure Functions, or Azure Data
Factory. Data Factory is optimal for transferring bulk data.

Embedding creation:

1. The document is ingested into Blob Storage, and an Azure function is triggered to
extract text from the documents.

2. If documents are in a non-English language and translation is required, an Azure


function can call Azure Translator to perform the translation.

3. If the documents are PDFs or images, an Azure function can call Azure AI
Document Intelligence to extract the text. If the document is an Excel, CSV, Word,
or text file, python code can be used to extract the text.

4. The extracted text is then chunked appropriately, and an Azure OpenAI embedding
model is used to convert each chunk to embeddings.

5. These embeddings are persisted to the vector database. This solution uses the
Enterprise tier of Azure Cache for Redis, but any vector database can be used.

Query and retrieval:

1. The user sends a query via a user application.

2. The Azure OpenAI embedding model is used to convert the query into vector
embeddings.

3. A vector similarity search that uses this query vector in the vector database returns
the top k matching content. The matching content to be retrieved can be set
according to a threshold that’s defined by a similarity measure, like cosine
similarity.

4. The top k retrieved content and the system prompt are sent to the Azure OpenAI
language model, like GPT-3.5 Turbo or GPT-4.

5. The search results are presented as the answer to the search query that was
initiated by the user, or the search results can be used as the grounding data for a
multi-turn conversation scenario.
Architecture: Azure Cognitive Search pull
approach
Index creation Query and retrieval
Pull Query
API

1 2 1
Storage Azure Cognitive Azure App User
accounts Search Service
3 4
AI enrichment skillsets
(optional) Create system prompt Call language model
2 3

Azure AI Azure AI Azure OpenAI


Translator Document language model
Intelligence


Download a Visio file of this architecture.

Index creation:

1. Azure Cognitive Search is used to create a search index of the documents in Blob
Storage. Azure Cognitive Search supports Blob Storage, so the pull model is used
to crawl the content, and the capability is implemented via indexers.

7 Note

Azure Cognitive Search supports other data sources for indexing when using
the pull model. Documents can also be indexed from multiple data sources
and consolidated into a single index.

2. If certain scenarios require translation of documents, Azure Translator can be used,


which is a feature that's included in the built-in skill.

3. If the documents are nonsearchable, like scanned PDFs or images, AI can be


applied by using built-in or custom skills as skillsets in Azure Cognitive Search.
Applying AI over content that isn't full-text searchable is called AI enrichment.
Depending on the requirement, Azure AI Document Intelligence can be used as a
custom skill to extract text from PDFs or images via document analysis models,
prebuilt models, or custom extraction models.

If AI enrichment is a requirement, pull model (indexers) must be used to load an


index.

If vector fields are added to the index schema, which loads the vector data for
indexing, vector search can be enabled by indexing that vector data. Vector data
can be generated via Azure OpenAI embeddings.
Query and retrieval:

1. A user sends a query via a user application.

2. The query is passed to Azure Cognitive Search via the search documents REST API.
The query type can be simple, which is optimal for full-text search, or full, which is
for advanced query constructs like regular expressions, fuzzy and wild card search,
and proximity search. If the query type is set to semantic, a semantic search is
performed on the documents, and the relevant content is retrieved. Azure
Cognitive Search also supports vector search and hybrid search, which requires the
user query to be converted to vector embeddings.

3. The retrieved content and the system prompt are sent to the Azure OpenAI
language model, like GPT-3.5 Turbo or GPT-4.

4. The search results are presented as the answer to the search query that was
initiated by the user, or the search results can be used as the grounding data for a
multi-turn conversation scenario.

Architecture: Azure Cognitive Search push


approach
If the data source isn't supported, you can use the push model to upload the data to
Azure Cognitive Search.

Push Index creation Query and retrieval


API 2
Query

Azure App Azure Cognitive Azure App 1


Files Service User
Search Service
4
For translating and extracting text 3 3
(optional) Call language model
Create system prompt

Azure OpenAI
Azure AI Azure AI language model
Translator Document
1 Intelligence 2


Download a Visio file of this architecture.

Index creation:

1. If the document to be ingested must be translated, Azure Translator can be used.


2. If the document is in a nonsearchable format, like a PDF or image, Azure AI
Document Intelligence can be used to extract text.
3. The extracted text can be vectorized via Azure OpenAI embeddings vector search,
and the data can be pushed to an Azure Cognitive Search index via a Rest API or
an Azure SDK.

Query and retrieval:

The query and retrieval in this approach is the same as the pull approach earlier in this
article.

Components
Azure OpenAI provides REST API access to Azure OpenAI's language models
including the GPT-3, Codex, and the embedding model series for content
generation, summarization, semantic search, and natural language-to-code
translation. Access the service by using a REST API, Python SDK, or the web-based
interface in the Azure OpenAI Studio .

Azure AI Document Intelligence is an Azure AI service . It offers document


analysis capabilities to extract printed and handwritten text, tables, and key-value
pairs. Azure AI Document Intelligence provides prebuilt models that can extract
data from invoices, documents, receipts, ID cards, and business cards. You can also
use it to train and deploy custom models by using a custom template form model
or a custom neural document model.

Document Intelligence Studio provides a UI for exploring Azure AI Document


Intelligence features and models, and for building, tagging, training, and deploying
custom models.

Azure Cognitive Search is a cloud service that provides infrastructure, APIs, and
tools for searching. Use Azure Cognitive Search to build search experiences over
private disparate content in web, mobile, and enterprise applications.

Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for various languages, such as .NET, Node.js, and Python.
Applications can access files in Blob Storage via HTTP or HTTPS. Blob Storage
has hot, cool, and archive access tiers to support cost optimization for storing large
amounts of data.

The Enterprise tier of Azure Cache for Redis provides managed Redis Enterprise
modules, like RediSearch, RedisBloom, RedisTimeSeries, and RedisJSON. Vector
fields allow vector similarity search, which supports real-time vector indexing
(brute force algorithm (FLAT) and hierarchical navigable small world algorithm
(HNSW)), real-time vector updates, and k-nearest neighbor search. Azure Cache for
Redis brings a critical low-latency and high-throughput data storage solution to
modern applications.
Alternatives
Depending on your scenario, you can add the following workflows.

Use the Azure AI Language features, question answering and conversational


language understanding, to build a natural conversational layer over your data.
These features find appropriate answers for the input from your custom
knowledge base of information.

To create vectorized data, you can use any embedding model. You can also use the
Azure AI services Vision image retrieval API to vectorize images. This tool is
available in private preview.

Use the Durable Functions extension for Azure Functions as a code-first integration
tool to perform text-processing steps, like reading handwriting, text, and tables,
and processing language to extract entities on data based on the size and scale of
the workload.

You can use any database for persistent storage of the extracted embeddings,
including:
Azure SQL Database
Azure Cosmos DB
Azure Database for PostgreSQL
Azure Database for MySQL

Scenario details
Manual processing is increasingly time-consuming, error-prone, and resource-intensive
due to the sheer volume of documents. Organizations that handle huge volumes of
documents, largely unstructured data of different formats like PDF, Excel, CSV, Word,
PowerPoint, and image formats, face a significant challenge processing scanned and
handwritten documents and forms from their customers.

These documents and forms contain critical information, such as personal details,
medical history, and damage assessment reports, which must be accurately extracted
and processed.

Organizations often already have their own knowledge base of information, which can
be used for answering questions with the most appropriate answer. You can use the
services and pipelines described in these solutions to create a source for search
mechanisms of documents.
Potential use cases
This solution provides value to organizations in industries like pharmaceutical
companies and financial services. It applies to any company that has a large number of
documents with embedded information. This AI-powered end-to-end search solution
can be used to extract meaningful information from the documents based on the user
query to provide a ChatGPT-style question and answer experience.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Dixit Arora | Senior Customer Engineer, ISV DN CoE


Jyotsna Ravi | Principal Customer Engineer, ISV DN CoE

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is Azure AI Document Intelligence?
What is Azure OpenAI?
What is Azure Machine Learning?
Introduction to Blob Storage
What is Azure AI Language?
Introduction to Azure Data Lake Storage Gen2
Azure QnA Maker client library
Create, train, and publish your QnA Maker knowledge base
What is question answering?

Related resources
Query-based document summarization
Automate document identification, classification, and search by using Durable
Functions
Index file content and metadata by using Azure Cognitive Search
AI enrichment with image and text processing
Extract and analyze call center
data
Azure Blob Storage Azure AI Speech Azure AI services Power BI

This article describes how to extract insights from customer conversations at a call
center by using Azure AI services and Azure OpenAI Service. Use these real-time and
post-call analytics to improve call center efficiency and customer satisfaction.

Architecture

Intelligent transcription Interact and visualize

Extract insights from call transcripts


Insights in Power BI
(Near real-time)
Caller

Extraction Storage Enrichment


File Audio
Person-to-person upload files
conversation
Azure AI Web app
Azure Language
Telephony Azure
Blob Storage
server Azure Blob Storage
Azure
Blob Storage
speech to text
Azure OpenAI CRM
Call-center agent Service

Detailed call
history, summaries,
reasons for calling


Download a PowerPoint file of this architecture.

Dataflow
1. A phone call between an agent and a customer is recorded and stored in Azure
Blob Storage. Audio files are uploaded to an Azure Storage account via a
supported method, such as the UI-based tool, Azure Storage Explorer, or a Storage
SDK or API.

2. Azure AI Speech is used to transcribe audio files in batch mode asynchronously


with speaker diarization enabled. The transcription results are persisted in Blob
Storage.
3. Azure AI Language is used to detect and redact personal data in the transcript.

For batch mode transcription and personal data detection and redaction, use the
AI services Ingestion Client tool. The Ingestion Client tool uses a no-code approach
for call center transcription.

4. Azure OpenAI is used to process the transcript and extract entities, summarize the
conversation, and analyze sentiments. The processed output is stored in Blob
Storage and then analyzed and visualized by using other services. You can also
store the output in a datastore for keeping track of metadata and for reporting.
Use Azure OpenAI to process the stored transcription information.

5. Power BI or a custom web application that's hosted by App Service is used to


visualize the output. Both options provide near real-time insights. You can store
this output in a CRM, so agents have contextual information about why the
customer called and can quickly solve potential problems. This process is fully
automated, which saves the agents time and effort.

Components
Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for languages like .NET, Node.js, and Python.
Applications can access files on Blob Storage via HTTP or HTTPS. Blob Storage has
hot, cool, and archive access tiers for storing large amounts of data, which
optimizes cost.

Azure OpenAI provides access to the Azure OpenAI language models, including
GPT-3, Codex, and the embeddings model series, for content generation,
summarization, semantic search, and natural language-to-code translation. You
can access the service through REST APIs, Python SDK, or the web-based interface
in the Azure OpenAI Studio .

Azure AI Speech is an AI-based API that provides speech capabilities like speech-
to-text, text-to-speech, speech translation, and speaker recognition. This
architecture uses the Azure AI Speech batch transcription functionality.

Azure AI Language consolidates the Azure natural-language processing services.


For information about prebuilt and customizable options, see Azure AI Language
available features.

Language Studio provides a UI for exploring and analyzing AI services for


language features. Language Studio provides options for building, tagging,
training, and deploying custom models.
Power BI is a software-as-a-service (SaaS) that provides visual and interactive
insights for business analytics. It provides transformation capabilities and connects
to other data sources.

Alternatives
Depending on your scenario, you can add the following workflows.

Perform conversation summarization by using the prebuilt model in Azure AI


Language.
Depending on the size and scale of your workload, you can use Azure Functions as
a code-first integration tool to perform text-processing steps, like text
summarization on extracted data.
Deploy and implement a custom speech-to-text solution.

Scenario details
This solution uses Azure AI Speech to convert audio into written text. Azure AI Language
redacts sensitive information in the conversation transcription. Azure OpenAI extracts
insights from customer conversation to improve call center efficiency and customer
satisfaction. Use this solution to process transcribed text, recognize and remove
sensitive information, and perform sentiment analysis. Scale the services and the
pipeline to accommodate any volume of recorded data.

Potential use cases


This solution provides value to organizations in industries like telecommunications and
financial services. It applies to any organization that records conversations. Customer-
facing or internal call centers or support desks benefit from using this solution.

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
Find the availability service level agreement (SLA) for each component in SLAs for
online services .
To design high-availability applications with Storage accounts, see the
configuration options.
To ensure resiliency of the compute services and datastores in this scenario, use
failure mode for services like Azure Functions and Storage. For more information,
see the resiliency checklist for Azure services.
Back up and recover your Form Recognizer models.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

Implement data protection, identity and access management, and network security
recommendations for Blob Storage, AI services, and Azure OpenAI.
Configure AI services virtual networks.

Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.

The total cost of this solution depends on the pricing tier of your services. Factors that
can affect the price of each component are:

The number of documents that you process.


The number of concurrent requests that your application receives.
The size of the data that you store after processing.
Your deployment region.

For more information, see the following resources:

Azure OpenAI pricing


Blob Storage pricing
Azure AI Language pricing
Azure Machine Learning pricing

Use the Azure pricing calculator to estimate your solution cost.

Performance efficiency
Performance efficiency is the ability of your workload to meet the demands placed on it
by users in an efficient manner. For more information, see Overview of the performance
efficiency pillar.

When high volumes of data are processed, it can expose performance bottlenecks. To
ensure proper performance efficiency, understand and plan for the scaling options to
use with the AI services autoscale feature.

The batch speech API is designed for high volumes, but other AI services APIs might
have request limits, depending on the subscription tier. Consider containerizing AI
services APIs to avoid slowing down large-volume processing. Containers provide
deployment flexibility in the cloud and on-premises. Mitigate side effects of new version
rollouts by using containers. For more information, see Container support in AI services.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Dixit Arora | Senior Customer Engineer, ISV DN CoE


Jyotsna Ravi | Principal Customer Engineer, ISV DN CoE

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
What is Azure AI Speech?
What is Azure OpenAI?
What is Azure Machine Learning?
Introduction to Blob Storage
What is Azure AI Language?
Introduction to Azure Data Lake Storage Gen2
What is Power BI?
Ingestion Client with AI services
Post-call transcription and analytics

Related resources
Use a speech-to-text transcription pipeline to analyze recorded conversations
Deploy a custom speech-to-text solution
Create custom language and acoustic models
Deploy a custom speech-to-text solution
Implement logging and
monitoring for Azure OpenAI
models
Azure AI services Azure API Management Azure Monitor Azure Active Directory

This solution provides comprehensive logging and monitoring and enhanced security
for enterprise deployments of the Azure OpenAI Service API. The solution enables
advanced logging capabilities for tracking API usage and performance and robust
security measures to help protect sensitive data and help prevent malicious activity.

Architecture

Download a Visio file of this architecture.

Workflow
1. Client applications access Azure OpenAI endpoints to perform text generation
(completions) and model training (fine-tuning).

2. Azure Application Gateway provides a single point of entry to Azure OpenAI


models and provides load balancing for APIs.

7 Note
Load balancing of stateful operations like model fine-tuning, deployments,
and inference of fine-tuned models isn't supported.

3. Azure API Management enables security controls and auditing and monitoring of
the Azure OpenAI models.
a. In API Management, enhanced-security access is granted via Microsoft Entra
groups with subscription-based access permissions.
b. Auditing is enabled for all interactions with the models via Azure Monitor
request logging.
c. Monitoring provides detailed Azure OpenAI model usage KPIs and metrics,
including prompt information and token statistics for usage traceability.

4. API Management connects to all Azure resources via Azure Private Link. This
configuration provides enhanced security for all traffic via private endpoints and
contains traffic in the private network.

5. Multiple Azure OpenAI instances enable scale-out of API usage to ensure high
availability and disaster recovery for the service.

Components
Application Gateway . Application load balancer to help ensure that all users of
the Azure OpenAI APIs get the fastest response and highest throughput for model
completions.
API Management . API management platform for accessing back-end Azure
OpenAI endpoints. Provides monitoring and logging that's not available natively in
Azure OpenAI.
Azure Virtual Network . Private network infrastructure in the cloud. Provides
network isolation so that all network traffic for models is routed privately to Azure
OpenAI.
Azure OpenAI . Service that hosts models and provides generative model
completion outputs.
Monitor . End-to-end observability for applications. Provides access to
application logs via Kusto Query Language. Also enables dashboard reports and
monitoring and alerting capabilities.
Azure Key Vault . Enhanced-security storage for keys and secrets that are used by
applications.
Azure Storage . Application storage in the cloud. Provides Azure OpenAI with
accessibility to model training artifacts.
Microsoft Entra ID . Enhanced-security identity manager. Enables user
authentication and authorization to the application and to platform services that
support the application. Also provides Group Policy to ensure that the principle of
least privilege is applied to all users.

Alternatives
Azure OpenAI provides native logging and monitoring. You can use this native
functionality to track telemetry of the service, but the default cognitive service logging
doesn't track or record inputs and outputs of the service, like prompts, tokens, and
models. These metrics are especially important for compliance and to ensure that the
service operates as expected. Also, by tracking interactions with the large language
models deployed to Azure OpenAI, you can analyze how your organization is using the
service to identify cost and usage patterns that can help inform decisions on scaling and
resource allocation.

The following table provides a comparison of the metrics provided by the default Azure
OpenAI logging and those provided by this solution.

Metric Default Azure OpenAI This solution


logging

Request count x x

Data in (size) / data out x x


(size)

Latency x x

Token transactions (total) x x

Caller IP address x (last octet masked) x

Model utilization x

Token utilization x x
(input/output)

Input prompt detail x (limited to 8,192 response


characters)

Output completion detail x (limited to 8,192 response


Metric Default Azure OpenAI This solution
logging

characters)

Deployment operations x x

Embedding operations x x (limited to 8,192 response


characters)

Scenario details
Large enterprises that use generative AI models need to implement auditing and
logging of the use of these models to ensure responsible use and corporate compliance.
This solution provides enterprise-level logging and monitoring for all interactions with
AI models to mitigate harmful use of the models and help ensure that security and
compliance standards are met. The solution integrates with existing APIs for Azure
OpenAI with little modification to take advantage of existing code bases. Administrators
can also monitor service usage for reporting.

The solution provides these advantages:

Comprehensive logging of Azure OpenAI model execution, tracked to the source


IP address. Log information includes text that users submit to the model and text
received back from the model. This logging helps ensure that models are used
responsibly and within the approved use cases of the service.
High availability of the model APIs to ensure that user requests are met even if the
traffic exceeds the limits of a single Azure OpenAI service.
Role-based access managed via Microsoft Entra ID to ensure that the principle of
least privilege is applied.

Example query for usage monitoring

ApiManagementGatewayLogs
| where OperationId == 'completions_create'
| extend modelkey = substring(parse_json(BackendResponseBody)['model'], 0,
indexof(parse_json(BackendResponseBody)['model'], '-', 0, -1, 2))
| extend model = tostring(parse_json(BackendResponseBody)['model'])
| extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])
['prompt_tokens']
| extend completiontokens = parse_json(parse_json(BackendResponseBody)
['usage'])['completion_tokens']
| extend totaltokens = parse_json(parse_json(BackendResponseBody)['usage'])
['total_tokens']
| extend ip = CallerIpAddress
| summarize
sum(todecimal(prompttokens)),
sum(todecimal(completiontokens)),
sum(todecimal(totaltokens)),
avg(todecimal(totaltokens))
by ip, model

Output:

Example query for prompt usage monitoring

ApiManagementGatewayLogs
| where OperationId == 'completions_create'
| extend model = tostring(parse_json(BackendResponseBody)['model'])
| extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])
['prompt_tokens']
| extend prompttext = substring(parse_json(parse_json(BackendResponseBody)
['choices'])[0], 0, 100)

Output:

Potential use cases


Deployment of Azure OpenAI for internal enterprise users to accelerate
productivity
High availability of Azure OpenAI for internal applications
Enhanced-security use of Azure OpenAI within regulated industries

Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that you can use to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.

Reliability
Reliability ensures that your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.

This scenario ensures high availability of the large language models for your enterprise
users. The Azure application gateway provides an effective layer-7 application delivery
mechanism to ensure fast and consistent access to applications. You can use API
Management to configure, manage, and monitor access to your models. The inherent
high availability of platform services like Storage, Key Vault, and Virtual Network ensure
high reliability for your application. Finally, multiple instances of Azure OpenAI ensure
service resilience in case of application-level failures. These architecture components can
help you ensure the reliability of your application at enterprise scale.

Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.

By implementing best practices for application-level and network-level isolation of your


cloud services, this scenario mitigates risks of data exfiltration and data leakage. All
network traffic containing potentially sensitive data that's input to the model is isolated
in a private network. This traffic doesn't traverse public internet routes. You can use
Azure ExpressRoute to further isolate network traffic to the corporate intranet and help
ensure end-to-end network security.

Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.

To help you explore the cost of running this scenario, we've preconfigured all the
services in the Azure pricing calculator. To learn how the pricing would change for your
use case, change the appropriate variables to match your expected traffic.

The following three sample cost profiles provide estimates based on the amount of
traffic. (The estimates assume that a document contains approximately 1,000 tokens.)

Small : For processing 10,000 documents per month.


Medium : For processing 100,000 documents per month.
Large : For processing 10 million documents per month.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal authors:

Ashish Chauhan | Cloud Solution Architect – Data / AI


Jake Wang | Cloud Solution Architect – AI / Machine Learning

Other contributors:

Mick Alberts | Technical Writer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Azure OpenAI request form
Best practices for prompt engineering with OpenAI API
Azure OpenAI: Documentation, quickstarts, API reference
Azure-Samples/openai-python-enterprise-logging (GitHub)
Configure Azure Cognitive Services virtual networks

Related resources
Protect APIs with Azure Application Gateway and Azure API Management
Query-based document summarization
AI architecture design
Build language model pipelines
with memory
Bing Web Search Azure Cache for Redis Azure Pipelines

Stay ahead of the competition by being informed and having a deep understanding of
your products and competitor products. An AI/machine learning pipeline helps you
quickly and efficiently gather, analyze, and summarize relevant information. This
architecture includes several powerful Azure OpenAI Service models. These models pair
with the popular open-source LangChain framework that's used to develop applications
that are powered by language models.

7 Note

Some parts in the introduction, components, and workflow of this article were
generated with the help of ChatGPT! Try it for yourself , or try it for your
enterprise.

Architecture

Download a PowerPoint file of this architecture.


Workflow
The batch pipeline stores internal company product information in a fast vector search
database. To achieve this result, the following steps are taken:

1. Internal company documents for products are imported and converted into
searchable vectors. Product-related documents are collected from departments,
such as sales, marketing, and product development. These documents are then
scanned and converted into text by using optical character recognition (OCR)
technology.
2. A LangChain chunking utility chunks the documents into smaller, more
manageable pieces. Chunking breaks down the text into meaningful phrases or
sentences that can be analyzed separately and improves the accuracy of the
pipeline's search capabilities.
3. The language model converts each chunk into a vectorized embedding.
Embeddings are a type of representation that capture the meaning and context of
the text. By converting each chunk into a vectorized embedding, you can store and
search for documents based on their meaning rather than their raw text. To
prevent loss of context within each document chunk, LangChain provides several
utilities for this text splitting step, like capabilities for sliding windows or specifying
text overlap. Some key features include utilities for tagging chunks with document
metadata, optimizing the document retrieval step, and downstream reference.
4. Create an index in a vector store database to store the raw document text,
embeddings vectors, and metadata. The resulting embeddings are stored in a
vector store database along with the raw text of the document and any relevant
metadata, such as the document's title and source.

After the batch pipeline is complete, the real-time, asynchronous pipeline searches for
relevant information. The following steps are taken:

5. Enter a query and relevant metadata, such as your role in the company or the
business unit that you work in. An embeddings model then converts your query
into a vectorized embedding.
6. The orchestrator language model decomposes your query, or main task, into the
set of subtasks that are required to answer your query. Converting the main task
into a series of simpler subtasks allows the language model to address each task
more accurately, which results in better answers with less tendency for inaccuracy.
7. The resulting embedding and decomposed subtasks are stored in the LangChain
model's memory.
a. Top internal document chunks that are relevant to your query are retrieved from
your internal database. A fast vector search is performed for the top n similar
documents that are stored as vectors in Azure Cache for Redis.
b. In parallel, a web search for similar external products is performed via the
LangChain Bing Search language model plugin with a generated search query
that the orchestrator language model composes. Results are stored in the
external model memory component.
8. The vector store database is queried and returns the top relevant product
information pages (chunks and references). The system queries the vector store
database by using your query embedding and returns the most relevant product
information pages, along with the relevant text chunks and references. The
relevant information is stored in LangChain's model memory.
9. The system uses the information that’s stored in LangChain's model memory to
create a new prompt, which is sent to the orchestrator language model to build a
summary report that’s based on your query, company internal knowledge base,
and external web results.
10. Optionally, the output from the previous step is passed to a moderation filter to
remove unwanted information. The final competitive product report is passed to
you.

Components
Azure OpenAI Service provides REST API access to OpenAI's powerful language
models, including the GPT-3, GPT-3.5, GPT-4, and embeddings model series. You
can easily adapt these models to your specific task, such as content generation,
summarization, semantic search, converting text to semantically powerful
embeddings vectors, and natural-language-to-code translation.

LangChain is a third-party, open-source framework that you can use to develop


applications that are powered by language models. LangChain makes the
complexities of working and building with AI models easier by providing the
pipeline orchestration framework and helper utilities to run powerful, multiple-
model pipelines.

Memory refers to capturing information. By default, language modeling chains (or


pipelines) and agents operate in a stateless manner. They handle each incoming
query independently, just like the underlying language models and chat models
that they use. But in certain applications, such as chatbots, it's crucial to retain
information from past interactions in the short term and the long term. This area is
where the concept of "memory" comes into play. LangChain provides convenient
utility tools to manage and manipulate past chat messages. These utilities are
designed to be modular regardless of their specific usage. LangChain also offers
seamless methods to integrate these utilities into the memory of chains by using
language models .
Semantic Kernel is an open-source software development kit (SDK) that you can
use to orchestrate and deploy language models. You can explore Semantic Kernel
as a potential alternative to LangChain.

Scenario details
This architecture uses an AI/machine learning pipeline, LangChain, and language models
to create a comprehensive analysis of how your product compares to similar competitor
products. The pipeline consists of two main components: a batch pipeline and a real-
time, asynchronous pipeline. When you send a query to the real-time pipeline, the
orchestrator language model, often GPT-4 or the most powerful available language
model, derives a set of tasks to answer your question. These subtasks invoke other
language models and APIs to mine the internal company product database and the
public internet to build a report that shows the competitive position of your products
versus the competitor products.

Potential use cases


You can apply this solution to the following scenarios:

Compare internal company product information that has an internal knowledge


base to competitor products that has information that's retrieved from a Bing web
search.
Perform a document search and information retrieval.
Create a chatbot for internal use that has an internal knowledge base and is also
enhanced by an external web search.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Brandon Cowen | Senior Specialized AI Cloud Solution Architect

Other contributor:

Ashish Chauhun | Senior Specialized AI Cloud Solution Architect

To see non-public LinkedIn profiles, sign in to LinkedIn.


Next steps
Azure Business Process Accelerator
Azure OpenAI
Azure OpenAI embeddings QnA
ChatGPT
Enterprise search with OpenAI architecture
Generative AI for developers: Exploring new tools and APIs in Azure OpenAI
Service
LangChain
Memory with language models
Quickstart: Get started generating text using Azure OpenAI Service
Redis on Azure OpenAI
Revolutionize your enterprise data with ChatGPT: Next-gen apps with Azure
OpenAI and Azure Cognitive Search
Semantic Kernel
Vector databases with Azure OpenAI

Related resources
AI architecture design
Batch processing
Types of language API services
Query-based document summarization
Article • 08/31/2023

This guide shows how to perform document summarization by using the Azure OpenAI
GPT-3 model. It describes concepts that are related to the document summarization
process, approaches to the process, and recommendations on which model to use for
specific use cases. Finally, it presents two use cases, together with sample code snippets,
to help you understand key concepts.

Architecture
The following diagram shows how a user query fetches relevant data. The summarizer
uses GPT-3 to generate a summary of the text of the most relevant document. In this
architecture, the GPT-3 endpoint is used to summarize the text.

Download a PowerPoint file of this architecture.

Workflow
This workflow occurs in near-real time.

1. A user sends a query. For example, an employee of a manufacturing company


searches for specific information about a machine part on the company portal. The
query is first processed by an intent recognizer like conversational language
understanding. The relevant entities or concepts in the user query are used to
select and present a subset of documents from a knowledge base that's populated
offline (in this case, the company's knowledge base database). The output is fed
into a search and analysis engine like Azure Elastic Search , which filters the
relevant documents to return a document set of hundreds instead of thousands or
tens of thousands.
2. The user query is applied again on a search endpoint like Azure Cognitive Search
to rank the retrieved document set in order of relevance (page ranking). The
highest-ranked document is selected.
3. The selected document is scanned for relevant sentences. This scanning process
uses either a coarse method, like extracting all sentences that contain the user
query, or a more sophisticated method, like GPT-3 embeddings, to find
semantically similar material in the document.
4. After the relevant text is extracted, the GPT-3 Completions endpoint with the
summarizer summarizes the extracted content. In this example, the summary of
important details about the part that the employee specified in the query is
returned.

This article focuses on the summarizer component of the architecture.

Scenario details
Enterprises frequently create and maintain a knowledge base about business processes,
customers, products, and information. However, returning relevant content based on a
user query of a large dataset is often challenging. The user can query the knowledge
base and find an applicable document by using methods like page rank, but delving
further into the document to search for relevant information typically becomes a manual
task that takes time. However, with recent advances in foundation transformer models
like the one developed by OpenAI, the query mechanism has been refined by semantic
search methods that use encoding information like embeddings to find relevant
information. These developments enable the ability to summarize content and present it
to the user in a concise and succinct way.

Document summarization is the process of creating summaries from large volumes of


data while maintaining significant informational elements and content value. This article
demonstrates how to use Azure OpenAI Service GPT-3 capabilities for your specific
use case. GPT-3 is a powerful tool that you can use for a range of natural language
processing tasks, including language translation, chatbots, text summarization, and
content creation. The methods and architecture described here are customizable and
can be applied to many datasets.

Potential use cases


Document summarization applies to any organizational domain that requires users to
search large amounts of reference data and generate a summary that concisely
describes relevant information. Typical domains include legal, financial, news, healthcare,
and academic organizations. Potential use cases of summarization are:

Generating summaries to highlight key insights about news, financial reporting,


and so on.
Creating a quick reference to support an argument, for example, in legal
proceedings.
Providing context for a paper's thesis, as in academic settings.
Writing literature reviews.
Annotating a bibliography.

Some benefits of using a summarization service for any use case are:

Reduced reading time.


More effective searching of large volumes of disparate data.
Reduced chance of bias from human summarization techniques. (This benefit
depends on how unbiased the training data is.)
Enabling employees and users to focus on more in-depth analysis.

In-context learning
Azure OpenAI Service uses a generative completion model. The model uses natural
language instructions to identify the requested task and the skill required, a process
known as prompt engineering. When you use this approach, the first part of the prompt
includes natural language instructions and/or examples of the desired task. The model
completes the task by predicting the most probable next text. This technique is known
as in-context learning.

With in-context learning, language models can learn tasks from just a few examples. The
language model is provided with a prompt that contains a list of input-output pairs that
demonstrate a task, and then with a test input. The model makes a prediction by
conditioning on the prompt and predicting the next tokens.

There are three main approaches to in-context learning: zero-shot learning, few-shot
learning, and fine-tuning methods that change and improve the output. These
approaches vary based on the amount of task-specific data that's provided to the
model.

Zero-shot: In this approach, no examples are provided to the model. Only the task
request is provided as input. In zero-shot learning, the model depends on previously
trained concepts. It responds based only on data that it's trained on. It doesn't
necessarily understand the semantic meaning, but it has a statistic understanding that's
based on everything that it's learned from the internet about what should be generated
next. The model attempts to relate the given task to existing categories that it has
already learned about and responds accordingly.

Few-shot: In this approach, several examples that demonstrate the expected answer
format and content are included in the call prompt. The model is provided with a very
small training dataset to guide its predictions. Training with a small set of examples
enables the model to generalize and understand unrelated but previously unseen tasks.
Creating few-shot examples can be challenging because you need to accurately
articulate the task that you want the model to perform. One commonly observed
problem is that models are sensitive to the writing style that's used in the training
examples, especially small models.

Fine-tuning: Fine-tuning is a process of tailoring models to your own datasets. In this


customization step, you can improve the process by:

Including a larger set of data (at least 500 examples).


Using traditional optimization techniques with backpropagation to readjust the
weights of the model. These techniques enable higher quality results than the
zero-shot or few-shot approaches provide by themselves.
Improving the few-shot approach by training the model weights with specific
prompts and a specific structure. This technique enables you to achieve better
results on a wider number of tasks without needing to provide examples in the
prompt. The result is less text sent and fewer tokens.

When you create a GPT-3 solution, the main effort is in the design and content of the
training prompt.

Prompt engineering
Prompt engineering is a natural language processing discipline that involves discovering
inputs that yield desirable or useful outputs. When a user prompts the system, the way
the content is expressed can dramatically change the output. Prompt design is the most
significant process for ensuring that the GPT-3 model provides a desirable and
contextual response.

The architecture described in this article uses the completions endpoint for
summarization. The completions endpoint is an Azure Cognitive Services API that
accepts a partial prompt or context as input and returns one or more outputs that
continue or complete the input text. A user provides input text as a prompt, and the
model generates text that attempts to match the context or pattern that's provided.
Prompt design is highly dependent on the task and data. Incorporating prompt
engineering into a fine-tuning dataset and investigating what works best before using
the system in production requires significant time and effort.

Prompt design
GPT-3 models can perform multiple tasks, so you need to be explicit in the goals of the
design. The models estimate the desired output based on the provided prompt.

For example, if you input the words "Give me a list of cat breeds," the model doesn't
automatically assume that you're asking for a list of cat breeds. You could be asking the
model to continue a conversation in which the first words are "Give me a list of cat
breeds" and the next ones are "and I'll tell you which ones I like." If the model just
assumed that you wanted a list of cats, it wouldn't be as good at content creation,
classification, or other tasks.

As described in Learn how to generate or manipulate text, there are three basic
guidelines for creating prompts:

Show and tell. Improve the clarity about what you want by providing instructions,
examples, or a combination of the two. If you want the model to rank a list of
items in alphabetical order or to classify a paragraph by sentiment, show it that
that's what you want.
Provide quality data. If you're building a classifier or want a model to follow a
pattern, be sure to provide enough examples. You should also proofread your
examples. The model can usually recognize spelling mistakes and return a
response, but it might assume misspellings are intentional, which can affect the
response.
Check your settings. The temperature and top_p settings control how
deterministic the model is in generating a response. If you ask it for a response
that has only one right answer, configure these settings at a lower level. If you
want more diverse responses, you might want to configure the settings at a higher
level. A common error is to assume that these settings are "cleverness" or
"creativity" controls.

Alternatives
Azure conversational language understanding is an alternative to the summarizer used
here. The main purpose of conversational language understanding is to build models
that predict the overall intention of an incoming utterance, extract valuable information
from it, and produce a response that aligns with the topic. It's useful in chatbot
applications when it can refer to an existing knowledge base to find the suggestion that
best corresponds to the incoming utterance. It doesn't help much when the input text
doesn't require a response. The intent in this architecture is to generate a short
summary of long textual content. The essence of the content is described in a concise
manner and all important information is represented.

Example scenarios

Use case: Summarizing legal documents


In this use case, a collection of legislative bills passed through Congress is summarized.
The summary is fine-tuned to bring it closer to a human-generated summary, which is
referred to as the ground truth summary.

Zero-shot prompt engineering is used to summarize the bills. The prompt and settings
are then modified to generate different summary outputs.

Dataset

The first dataset is the BillSum dataset for summarization of US Congressional and
California state bills. This example uses only the Congressional bills. The data is split into
18,949 bills to use for training and 3,269 bills to use for testing. BillSum focuses on mid-
length legislation that's between 5,000 and 20,000 characters long. It's cleaned and
preprocessed.

For more information about the dataset and instructions for download, see FiscalNote /
BillSum .

BillSum schema

The schema of the BillSum dataset includes:

bill_id . An identifier for the bill.

text . The bill text.

summary . A human-written summary of the bill.


title . The bill title.

text_len . The character length of the bill.


sum_len . The character length of the bill summary.

In this use case, the text and summary elements are used.
Zero-shot
The goal here is to teach the GPT-3 model to learn conversation-style input. The
completions endpoint is used to create an Azure OpenAI API and a prompt that
generates the best summary of the bill. It's important to create the prompts carefully so
that they extract relevant information. To extract general summaries from a given bill,
the following format is used.

Prefix: What you want it to do.


Context primer: The context.
Context: The information needed to provide a response. In this case, the text to
summarize.
Suffix: The intended form of the answer. For example, an answer, a completion, or
a summary.

Python

API_KEY = # SET YOUR OWN API KEY HERE


RESOURCE_ENDPOINT = " -- # SET A LINK TO YOUR RESOURCE ENDPOINT -- "

openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01-preview"
prompt_i = 'Summarize the legislative bill given the title and the
text.\n\nTitle:\n'+" ".join([normalize_text(bill_title_1)])+ '\n\nText:\n'+
" ".join([normalize_text(bill_text_1)])+'\n\nSummary:\n'
response = openai.Completion.create(
engine=TEXT_DAVINCI_001
prompt=prompt_i,
temperature=0.4,
max_tokens=500,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
stop=['\n\n###\n\n'], # The ending token used during inference. Once it
reaches this token, GPT-3 knows the completion is over.
best_of=1
)
= 1

Original text: SAMPLE_BILL_1 .

Ground truth: National Science Education Tax Incentive for Businesses Act of 2007 -
Amends the Internal Revenue Code to allow a general business tax credit for
contributions of property or services to elementary and secondary schools and for
teacher training to promote instruction in science, technology, engineering, or
mathematics.
Zero-shot model summary: The National Science Education Tax Incentive for Businesses
Act of 2007 would create a new tax credit for businesses that make contributions to
science, technology, engineering, and mathematics (STEM) education at the elementary
and secondary school level. The credit would be equal to 100 percent of the qualified
STEM contributions of the taxpayer for the taxable year. Qualified STEM contributions
would include STEM school contributions, STEM teacher externship expenses, and STEM
teacher training expenses.

Observations: The zero-shot model generates a succinct, generalized summary of the


document. It's similar to the human-written ground truth and captures the same key
points. It's organized like a human-written summary and remains focused on the point.

Fine-tuning
Fine-tuning improves upon zero-shot learning by training on more examples than you
can include in the prompt, so you achieve better results on a wider number of tasks.
After a model is fine-tuned, you don't need to provide examples in the prompt. Fine-
tuning saves money by reducing the number of tokens required and enables lower-
latency requests.

At a high level, fine-tuning includes these steps:

Prepare and upload training data.


Train a new fine-tuned model.
Use the fine-tuned model.

For more information, see How to customize a model with Azure OpenAI Service.

Prepare data for fine-tuning

This step enables you to improve upon the zero-shot model by incorporating prompt
engineering into the prompts that are used for fine-tuning. Doing so helps give
directions to the model on how to approach the prompt/completion pairs. In a fine-tune
model, prompts provide a starting point that the model can learn from and use to make
predictions. This process enables the model to start with a basic understanding of the
data, which can then be improved upon gradually as the model is exposed to more data.
Additionally, prompts can help the model to identify patterns in the data that it might
otherwise miss.

The same prompt engineering structure is also used during inference, after the model is
finished training, so that the model recognizes the behavior that it learned during
training and can generate completions as instructed.
Python

#Adding variables used to design prompts consistently across all examples


#You can learn more here: https://learn.microsoft.com/azure/cognitive-
services/openai/how-to/prepare-dataset

LINE_SEP = " \n "


PROMPT_END = " [end] "
#Injecting the zero-shot prompt into the fine-tune dataset
def stage_examples(proc_df):
proc_df['prompt'] = proc_df.apply(lambda x:"Summarize the legislative
bill. Do not make up facts.\n\nText:\n"+"
".join([normalize_text(x['prompt'])])+'\n\nSummary:', axis=1)
proc_df['completion'] = proc_df.apply(lambda x:"
"+normalize_text(x['completion'])+PROMPT_END, axis=1)

return proc_df

df_staged_full_train = stage_examples(df_prompt_completion_train)
df_staged_full_val = stage_examples(df_prompt_completion_val)

Now that the data is staged for fine-tuning in the proper format, you can start running
the fine-tune commands.

Next, you can use the OpenAI CLI to help with some of the data preparation steps. The
OpenAI tool validates data, provides suggestions, and reformats data.

Python

openai tools fine_tunes.prepare_data -f


data/billsum_v4_1/prompt_completion_staged_train.csv

openai tools fine_tunes.prepare_data -f


data/billsum_v4_1/prompt_completion_staged_val.csv

Fine-tune the dataset

Python

payload = {
"model": "curie",
"training_file": " -- INSERT TRAINING FILE ID -- ",
"validation_file": "-- INSERT VALIDATION FILE ID --",
"hyperparams": {
"n_epochs": 1,
"batch_size": 200,
"learning_rate_multiplier": 0.1,
"prompt_loss_weight": 0.0001
}
}

url = RESOURCE_ENDPOINT + "openai/fine-tunes?api-version=2022-12-01-preview"


r = requests.post(url,
headers={
"api-key": API_KEY,
"Content-Type": "application/json"
},
json = payload
)
data = r.json()
print(data)
fine_tune_id = data['id']
print('Endpoint Called: {endpoint}'.format(endpoint = url))
print('Status Code: {status}'.format(status= r.status_code))
print('Fine tuning job ID: {id}'.format(id=fine_tune_id))
print('Response Information \n\n {text}'.format(text=r.text))

Evaluate the fine-tuned model

This section demonstrates how to evaluate the fine-tuned model.

Python

#Run this cell to check status


url = RESOURCE_ENDPOINT + "openai/fine-tunes/<--insert fine-tune id-->?api-
version=2022-12-01-preview"
r = requests.get(url,
headers={
"api-key": API_KEY,
"Content-Type": "application/json"
}
)

data = r.json()
print('Endpoint Called: {endpoint}'.format(endpoint = url))
print('Status Code: {status}'.format(status= r.status_code))
print('Fine tuning ID: {id}'.format(id=fine_tune_id))
print('Status: {status}'.format(status = data['status']))
print('Response Information \n\n {text}'.format(text=r.text))

Original text: SAMPLE_BILL_1 .

Ground truth: National Science Education Tax Incentive for Businesses Act of 2007 -
Amends the Internal Revenue Code to allow a general business tax credit for
contributions of property or services to elementary and secondary schools and for
teacher training to promote instruction in science, technology, engineering, or
mathematics.
Fine-tuned model summary: This bill provides a tax credit for contributions to
elementary and secondary schools that benefit science, technology, engineering, and
mathematics education. The credit is equal to 100% of qualified STEM contributions
made by taxpayers during the taxable year. Qualified STEM contributions include: (1)
STEM school contributions, (2) STEM teacher externship expenses, and (3) STEM teacher
training expenses. The bill also provides a tax credit for contributions to elementary and
secondary schools that benefit science, technology, engineering, or mathematics
education. The credit is equal to 100% of qualified STEM service contributions made by
taxpayers during the taxable year. Qualified STEM service contributions include: (1)
STEM service contributions paid or incurred during the taxable year for services
provided in the United States or on a military base outside the United States; and (2)
STEM inventory property contributed during the taxable year which is used by an
educational organization located in the United States or on a military base outside the
United States in providing education in grades K-12 in science, technology, engineering
or mathematics.

For the results of summarizing a few more bills by using the zero-shot and fine-tune
approaches, see Results for BillSum Dataset .

Observations: Overall, the fine-tuned model does an excellent job of summarizing the
bill. It captures domain-specific jargon and the key points that are represented but not
explained in the human-written ground truth. It differentiates itself from the zero-shot
model by providing a more detailed and comprehensive summary.

Use case: Financial reports


In this use case, zero-shot prompt engineering is used to create summaries of financial
reports. A summary of summaries approach is then used to generate the results.

Summary of summaries approach


When you write prompts, the GPT-3 total of the prompt and the resulting completion
must include fewer than 4,000 tokens, so you're limited to a couple pages of summary
text. For documents that typically contain more than 4,000 tokens (roughly 3,000 words),
you can use a summary of summaries approach. When you use this approach, the entire
text is first divided up to meet the token constraints. Summaries of the shorter texts are
then derived. In the next step, a summary of the summaries is created. This use case
demonstrates the summary of summaries approach with a zero-shot model. This
solution is useful for long documents. Additionally, this section describes how different
prompt engineering practices can vary the results.
7 Note

Fine-tuning is not applied in the financial use case because there's not enough data
available to complete that step.

Dataset
The dataset for this use case is technical and includes key quantitative metrics to assess
a company's performance.

The financial dataset includes:

url : The URL for the financial report.


pages : The page in the report that contains key information to be summarized (1-

indexed).
completion : The ground truth summary of the report.

comments : Any additional information that's needed.

In this use case, Rathbone's financial report , from the dataset, will be summarized.
Rathbone's is an individual investment and wealth management company for private
clients. The report highlights Rathbone's performance in 2020 and mentions
performance metrics like profit, FUMA, and income. The key information to summarize is
on page 1 of the PDF.

Python

API_KEY = # SET YOUR OWN API KEY HERE


RESOURCE_ENDPOINT = "# SET A LINK TO YOUR RESOURCE ENDPOINT"

openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01-preview"
name = os.path.abspath(os.path.join(os.getcwd(), '---INSERT PATH OF LOCALLY
DOWNLOADED RATHBONES_2020_PRELIM_RESULTS---')).replace('\\', '/')

pages_to_summarize = [0]
# Using pdfminer.six to extract the text
# !pip install pdfminer.six
from pdfminer.high_level import extract_text
t = extract_text(name
, page_numbers=pages_to_summarize
)
print("Text extracted from " + name)
t
Zero-shot approach

When you use the zero-shot approach, you don't provide solved examples. You provide
only the command and the unsolved input. In this example, the Instruct model is used.
This model is specifically intended to take in an instruction and record an answer for it
without extra context, which is ideal for the zero-shot approach.

After you extract the text, you can use various prompts to see how they influence the
quality of the summary:

Python

#Using the text from the Rathbone's report, you can try different prompts to
see how they affect the summary

prompt_i = 'Summarize the key financial information in the report using


qualitative metrics.\n\nText:\n'+" ".join([normalize_text(t)])+'\n\nKey
metrics:\n'

response = openai.Completion.create(
engine="davinci-instruct",
prompt=prompt_i,
temperature=0,
max_tokens=2048-int(len(prompt_i.split())*1.5),
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(response.choices[0].text)
>>>
- Funds under management and administration (FUMA) reached £54.7 billion at
31 December 2020, up 8.5% from £50.4 billion at 31 December 2019
- Operating income totalled £366.1 million, 5.2% ahead of the prior year
(2019: £348.1 million)
- Underlying1 profit before tax totalled £92.5 million, an increase of 4.3%
(2019: £88.7 million); underlying operating margin of 25.3% (2019: 25.5%)

# Different prompt

prompt_i = 'Extract most significant money related values of financial


performance of the business like revenue, profit, etc. from the below text
in about two hundred words.\n\nText:\n'+"
".join([normalize_text(t)])+'\n\nKey metrics:\n'

response = openai.Completion.create(
engine="davinci-instruct",
prompt=prompt_i,
temperature=0,
max_tokens=2048-int(len(prompt_i.split())*1.5),
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(response.choices[0].text)
>>>
- Funds under management and administration (FUMA) grew by 8.5% to reach
£54.7 billion at 31 December 2020
- Underlying profit before tax increased by 4.3% to £92.5 million,
delivering an underlying operating margin of 25.3%
- The board is announcing a final 2020 dividend of 47 pence per share, which
brings the total dividend to 72 pence per share, an increase of 2.9% over
2019

Challenges

As you can see, the model might produce metrics that aren't mentioned in the
original text.

Proposed solution: You can resolve this problem by changing the prompt.

The summary might focus on one section of the article and neglect other
important information.

Proposed solution: You can try a summary of summaries approach. Divide the
report into sections and create smaller summaries that you can then summarize to
create the output summary.

This code implements the proposed solutions:

Python

# Body of function

from pdfminer.high_level import extract_text

text = extract_text(name
, page_numbers=pages_to_summarize
)

r = splitter(200, text)

tok_l = int(2000/len(r))
tok_l_w = num2words(tok_l)

res_lis = []
# Stage 1: Summaries
for i in range(len(r)):
prompt_i = f'Extract and summarize the key financial numbers and
percentages mentioned in the Text in less than {tok_l_w}
words.\n\nText:\n'+normalize_text(r[i])+'\n\nSummary in one paragraph:'
response = openai.Completion.create(
engine=TEXT_DAVINCI_001,
prompt=prompt_i,
temperature=0,
max_tokens=tok_l,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
t = trim_incomplete(response.choices[0].text)
res_lis.append(t)

# Stage 2: Summary of summaries


prompt_i = 'Summarize the financial performance of the business like
revenue, profit, etc. in less than one hundred words. Do not make up values
that are not mentioned in the Text.\n\nText:\n'+"
".join([normalize_text(res) for res in res_lis])+'\n\nSummary:\n'

response = openai.Completion.create(
engine=TEXT_DAVINCI_001,
prompt=prompt_i,
temperature=0,
max_tokens=200,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)

print(trim_incomplete(response.choices[0].text))

The input prompt includes the original text from Rathbone's financial report for a
specific year.

Ground truth: Rathbones has reported revenue of £366.1m in 2020, up from £348.1m in
2019, and an increase in underlying profit before tax to £92.5m from £88.7m. Assets
under management rose 8.5% from £50.4bn to £54.7bn, with assets in wealth
management increasing 4.4% to £44.9bn. Net inflows were £2.1bn in 2020 compared
with £600m in the previous year, driven primarily by £1.5bn inflows into its funds
business and £400m due to the transfer of assets from Barclays Wealth.

Zero-shot summary of summaries output: Rathbones delivered a strong performance


in 2020, with funds under management and administration (FUMA) growing by 8.5% to
reach £54.7 billion at the end of the year. Underlying profit before tax increased by 4.3%
to £92.5 million, delivering an underlying operating margin of 25.3%. Total net inflows
across the group were £2.1 billion, representing a growth rate of 4.2%. Profit before tax
for the year was £43.8 million, with basic earnings per share totalling 49.6p. Operating
income for the year was 5.2% ahead of the prior year, totalling £366.1 million.

Observations: The summary of summaries approach generates a great result set that
resolves the challenges encountered initially when a more detailed and comprehensive
summary was provided. It does a great job of capturing the domain-specific jargon and
the key points, which are represented in the ground truth but not explained well.

The zero-shot model works well for summarizing mainstream documents. If the data is
industry-specific or topic-specific, contains industry-specific jargon, or requires industry-
specific knowledge, fine-tuning performs best. For example, this approach works well for
medical journals, legal forms, and financial statements. You can use the few-shot
approach instead of zero-shot to provide the model with examples of how to formulate
a summary, so it can learn to mimic the summary provided. For the zero-shot approach,
this solution doesn't retrain the model. The model's knowledge is based on the GPT-3
training. GPT-3 is trained with almost all available data from the internet. It performs
well for tasks that don't require specific knowledge.

For the results of using the zero-shot summary of summaries approach on a few reports
in the financial dataset, see Results for Summary of Summaries .

Recommendations
There are many ways to approach summarization by using GPT-3, including zero-shot,
few-shot, and fine-tuning. The approaches produce summaries of varying quality. You
can explore which approach produces the best results for your intended use case.

Based on observations on the testing presented in this article, here are few
recommendations:

Zero-shot is best for mainstream documents that don't require specific domain
knowledge. This approach attempts to capture all high-level information in a
succinct, human-like manner and provides a high-quality baseline summary. Zero-
shot creates a high-quality summary for the legal dataset that's used in the tests in
this article.
Few-shot is difficult to use for summarizing long documents because the token
limitation is exceeded when an example text is provided. You can instead use a
zero-shot summary of summaries approach for long documents or increase the
dataset to enable successful fine-tuning. The summary of summaries approach
generates excellent results for the financial dataset that's used in these tests.
Fine-tuning is most useful for technical or domain-specific use cases when the
information isn't readily available. To achieve the best results with this approach,
you need a dataset that contains a couple thousand samples. Fine-tuning captures
the summary in a few templated ways, trying to conform to how the dataset
presents the summaries. For the legal dataset, this approach generates a higher
quality of summary than the one created by the zero-shot approach.

Evaluating summarization
There are multiple techniques for evaluating the performance of summarization models.

Here are a few:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation). This technique includes


measures for automatically determining the quality of a summary by comparing it to
ideal summaries created by humans. The measures count the number of overlapping
units, like n-gram, word sequences, and word pairs, between the computer-generated
summary being evaluated and the ideal summaries.

Here's an example:

Python

reference_summary = "The cat ison porch by the tree"


generated_summary = "The cat is by the tree on the porch"
rouge = Rouge()
rouge.get_scores(generated_summary, reference_summary)
[{'rouge-1': {'r':1.0, 'p': 1.0, 'f': 0.999999995},
'rouge-2': {'r': 0.5714285714285714, 'p': 0.5, 'f': 0.5333333283555556},
'rouge-1': {'r': 0.75, 'p': 0.75, 'f': 0.749999995}}]

BERTScore. This technique computes similarity scores by aligning generated and


reference summaries on a token level. Token alignments are computed greedily to
maximize the cosine similarity between contextualized token embeddings from BERT.

Here's an example:

Python

import torchmetrics
from torchmetrics.text.bert import BERTScore
preds = "You should have ice cream in the summer"
target = "Ice creams are great when the weather is hot"
bertscore = BERTScore()
score = bertscore(preds, target)
print(score)
Similarity matrix. A similarity matrix is a representation of the similarities between
different entities in a summarization evaluation. You can use it to compare different
summaries of the same text and measure their similarity. It's represented by a two-
dimensional grid, where each cell contains a measure of the similarity between two
summaries. You can measure the similarity by using various methods, like cosine
similarity, Jaccard similarity, and edit distance. You then use the matrix to compare the
summaries and determine which one is the most accurate representation of the original
text.

Here's a sample command that gets the similarity matrix of a BERTScore comparison of
two similar sentences:

Python

bert-score-show --lang en -r "The cat is on the porch by the tree"


-c "The cat is by the tree on the porch"
-f out.png

The first sentence, "The cat is on the porch by the tree", is referred to as the candidate.
The second sentence is referred to as the reference. The command uses BERTScore to
compare the sentences and generate a matrix.

This following matrix displays the output that's generated by the preceding command:
For more information, see SummEval: Reevaluating Summarization Evaluation . For a
PyPI toolkit for summarization, see summ-eval 0.892 .

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Noa Ben-Efraim | Data & Applied Scientist

Other contributors:

Mick Alberts | Technical Writer


Rania Bayoumy | Senior Technical Program Manager
Harsha Viswanath | Principal Applied Science Manager
To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
Azure OpenAI - Documentation, quickstarts, API reference
What are intents in LUIS?
Conversational language understanding
Jupyter Notebook with technical details and execution of this use case

Related resources
AI architecture design
Choose a Microsoft cognitive services technology
Natural language processing technology
Conversation summarization
Azure AI services

Most businesses provide customer service support to help customers with product
queries, troubleshooting, and maintaining or upgrading features or the product itself. To
provide a satisfactory resolution, customer support specialists need to respond quickly
with accurate information. OpenAI can help organizations with customer support in a
variety of ways.

This guide describes how to generate summaries of customer-agent interactions by


using the Azure OpenAI GPT-3 model. It contains an end-to-end sample architecture
that illustrates the key components involved in getting a summary of a text input. The
generation of the text input is outside the scope of this guide. The focus of this guide is
to describe the process of implementing the summarization of a set of sample agent-
customer conversations and analyze the outcomes of various approaches to
summarization.

Conversation scenarios
Self-service chatbots (fully automated). In this scenario, customers can interact
with a chatbot that's powered by GPT-3 and trained on industry-specific data. The
chatbot can understand customer questions and answer appropriately based on
responses learned from a knowledge base.
Chatbot with agent intervention (semi-automated). Questions posed by
customers are sometimes complex and necessitate human intervention. In such
cases, GPT-3 can provide a summary of the customer-chatbot conversation and
help the agent with quick searches for additional information from a large
knowledge base.
Summarizing transcripts (fully automated or semi-automated). In most customer
support centers, agents are required to summarize conversations for record
keeping, future follow-up, training, and other internal processes. GPT-3 can
provide automated or semi-automated summaries that capture salient details of
conversations for further use.

This guide focuses on the process for summarizing transcripts by using Azure OpenAI
GPT-3.

On average, it takes an agent 5 to 6 minutes to summarize a single agent-customer


conversation. Given the high volumes of requests service teams handle on any given
day, this additional task can overburden the team. OpenAI is a good way to help agents
with summarization-related activities. It can improve the efficiency of the customer
support process and provide better precision. Conversation summarization can be
applied to any customer support task that involves agent-customer interaction.

Conversation summarization service


Conversation summarization is suitable in scenarios where customer support
conversations follow a question-and-answer format.

Some benefits of using a summarization service are:

Increased efficiency: It allows customer service agents to quickly summarize


customer conversations, eliminating the need for long back-and-forth exchanges.
This efficiency helps to speed up the resolution of customer problems.
Improved customer service: Agents can use summaries of conversations in future
interactions to quickly find the information needed to accurately resolve customer
concerns.
Improved knowledge sharing: Conversation summarization can help customer
service teams share knowledge with each other quickly and effectively. It equips
customer service teams with better resolutions and helps them provide faster
support.

Architecture
A typical architecture for a conversation summarizer has three main stages: pre-
processing, summarization, and post-processing. If the input contains a verbal
conversation or any form of speech, the speech needs to be transcribed to text. For
more information, see Azure Speech-to-text service .

Here's a sample architecture:


Download a PowerPoint file of this architecture.

Workflow
1. Gather input data: Feed relevant input data into the pipeline. If the source is an
audio file, you need to convert it to text by using a TTS service like Azure text-to-
speech.
2. Pre-process the data: Remove confidential information and any unimportant
conversation from the data.
3. Feed the data into the summarizer: Pass the data in a prompt via Azure OpenAI
APIs. In-context learning models include zero-shot, few-shot, or a custom model.
4. Generate a summary: The model generates a summary of the conversation.
5. Post-process the data: Apply a profanity filter and various validation checks to the
summary. Add sensitive or confidential data that was removed during the pre-
process step back into the summary.
6. Evaluate the results: Review and evaluate the results. This step can help you
identify areas where the model needs to be improved and find errors.

The following sections provide more details about the three main stages.

Pre-process
The goal of pre-processing is to ensure that the data provided to the summarizer service
is relevant and doesn't include sensitive or confidential information.
Here are some pre-processing steps that can help condition your raw data. You might
need to apply one or many steps, depending on the use case.

Remove personally identifiable information (PII). You can use the Conversational
PII API (preview) to remove PII from transcribed or written text. This example shows
the output after the API has removed PII:

Document text: Parker Doe has repaid all of their loans as of


2020-04-25. Their SSN is 999-99-9999. To contact them, use
their phone number 555-555-0100. They are originally from
Brazil and have Brazilian CPF number 998.214.865-68
Redacted document text: ******* has repaid all of their
loans as of *******. Their SSN is *******. To contact
them, use their phone number *******. They are originally from
Brazil and have Brazilian CPF number 998.214.865-68

...Entity 'Parker Doe' with category 'Person' got redacted


...Entity '2020-04-25' with category 'DateTime' got redacted
...Entity '999-99-9999' with category 'USSocialSecurityNumber' got
redacted
...Entity '555-555-0100' with category 'PhoneNumber' got redacted

Remove extraneous information. Customer agents start conversations with casual


exchanges that don't include relevant information. A trigger can be added to a
conversation to identify the point where the concern or relevant question is first
addressed. Removing that exchange from the context can improve the accuracy of
the summarizer service because the model is then fine-tuned on the most relevant
information in the conversation. The Curie GPT-3 engine is a popular choice for
this task because it's trained extensively, via content from the internet, to identify
this type of casual conversation.

Remove excessively negative conversations. Conversations can also include


negative sentiments from unhappy customers. You can use Azure content-filtering
methods like Azure Content Moderator to remove conversations that contain
sensitive information from analysis. Alternatively, OpenAI offers a moderation
endpoint, a tool that you can use to check whether content complies with
OpenAI's content policies.

Summarizer
OpenAI's text-completion API endpoint is called the completions endpoint. To start the
text-completion process, it requires a prompt. Prompt engineering is a process used in
large language models. The first part of the prompt includes natural language
instructions and/or examples of the specific task requested (in this scenario,
summarization). Prompts allow developers to provide some context to the API, which
can help it generate more relevant and accurate text completions. The model then
completes the task by predicting the most probable next text. This technique is known
as in-context learning.

7 Note

Extractive summarization attempts to identify and extract salient information from a


text and group it to produce a concise summary without understanding the
meaning or context.

Abstractive summarization rewrites a text by first creating an internal semantic


representation and then creating a summary by using natural language processing.
This process involves paraphrasing.

There are three main approaches for training models for in-context learning: zero-shot,
few-shot and fine-tuning. These approaches vary based on the amount of task-specific
data that's provided to the model.

Zero-shot: In this approach, no examples are provided to the model. The task
request is the only input. In zero-shot learning, the model relies on data that GPT-3
is already trained on (almost all available data from the internet). It attempts to
relate the given task to existing categories that it has already learned about and
responds accordingly.

Few-shot: When you use this approach, you include a small number of examples in
the prompt that demonstrate the expected answer format and the context. The
model is provided with a very small amount of training data, typically just a few
examples, to guide its predictions. Training with a small set of examples enables
the model to generalize and understand related but previously unseen tasks.
Creating these few-shot examples can be challenging because they need to clarify
the task you want the model to perform. One commonly observed problem is that
models, especially small ones, are sensitive to the writing style that's used in the
training examples.

The main advantages of this approach are a significant reduction in the need for
task-specific data and reduced potential to learn an excessively narrow distribution
from a large but narrow fine-tuning dataset.

With this approach, you can't update the weights of the pretrained model.

For more information, see Language Models are few-shot learners .


Fine-tuning: Fine-tuning is the process of tailoring models to get a specific desired
outcome from your own datasets. It involves retraining models on new data. For
more information, see Learn how to customize a model for your application.

You can use this customization step to improve your process by:
Including a larger set of example data.
Using traditional optimization techniques with backpropagation to readjust the
weights of the model. These techniques enable higher quality results than the
zero-shot or few-shot approaches provide by themselves.
Improving the few-shot learning approach by training the model weights with
specific prompts and a specific structure. This technique enables you to achieve
better results on a wider number of tasks without needing to provide examples
in the prompt. The result is less text sent and fewer tokens.

Disadvantages include the need for a large new dataset for every task, the
potential for poor generalization out of distribution, and the possibility to exploit
spurious features of the training data, resulting in high chances of unfair
comparison with human performance.

Creating a dataset for model customization is different from designing prompts for
use with the other models. Prompts for completion calls often use either detailed
instructions or few-shot learning techniques and consist of multiple examples. For
fine-tuning, we recommend that each training example consists of a single input
example and its desired output. You don't need to provide detailed instructions or
examples in the prompt.

As you increase the number of training examples, your results improve. We


recommend including at least 500 examples. It's typical to use between thousands
and hundreds of thousands of labeled examples. Testing indicates that each
doubling of the dataset size leads to a linear increase in model quality.

This guide demonstrates the curie-instruct/text-curie-001 and davinci-instruct/text-


davinci-001 engines. These engines are frequently updated. The version you use might
be different.

Post-process
We recommend that you check the validity of the results that you get from GPT-3.
Implement validity checks by using a programmatic approach or classifiers, depending
on the use case. Here are some critical checks:

Verify that no significant points are missed.


Check for factual inaccuracies.
Check for any bias introduced by the training data used on the model.
Verify that the model doesn't change text by adding new ideas or points. This
problem is known as hallucination.
Check for grammatical and spelling errors.
Use a content profanity filter like Content Moderator to ensure that no
inappropriate or irrelevant content is included.

Finally, reintroduce any vital information that was previously removed from the
summary, like confidential information.

In some cases, a summary of the conversation is also sent to the customer, along with
the original transcript. In these cases, post-processing involves appending the transcript
to the summary. It can also include adding lead-in sentences like "Please see the
summary below."

Considerations
It's important to fine-tune your base models with an industry-specific training dataset
and change the size of available datasets. Fine-tuned models perform best when the
training data includes at least 1,000 data points and the ground truth (human-generated
summaries) used to train the models is of high quality.

The tradeoff is cost. The process of labeling and cleaning datasets can be expensive. To
ensure high-quality training data, you might need to manually inspect ground truth
summaries and rewrite low-quality summaries. Consider the following points about the
summarization stage:

Prompt engineering: When provided with little instruction, Davinci often performs
better than other models. To optimize results, experiment with different prompts
for different models.
Token size: A summarizer that's based on GPT-3 is limited to a total of 4,098
tokens, including the prompt and completion. To summarize larger passages,
separate the text into parts that conform to these constraints. Summarize each part
individually and then collect the results in a final summary.
Garbage in, garbage out: Trained models are only as good as the training data that
you provide. Be sure that the ground truth summaries in the training data are well
suited to the information that you eventually want to summarize in your dialogs.
Stopping point: The model stops summarizing when it reaches a natural stopping
point or a stop sequence that you provide. Test this parameter to choose among
multiple summaries and to check whether summaries look incomplete.
Example scenario: Summarizing transcripts in
call centers
This scenario demonstrates how the Azure OpenAI summarization feature can help
customer service agents with summarization tasks. It tests the zero-shot, few-shot, and
fine-tuning approaches and compares the results against human-generated summaries.

The dataset used in this scenario is a set of hypothetical conversations between


customers and agents in the Xbox customer support center about various Xbox
products and services. The hypothetical chat is labeled with Prompt. The human-written
abstractive summary is labeled with Completion.

Prompt Completion

Customer: Question on XAIL Customer wants to know if they need to


sign up for preview rings to join Xbox
Agent: Hello! How can I help you today? Accessibility Insider League. Agent
responds that it is not mandatory, but that
Customer: Hi, I have a question about the some experiences may require it.
Accessibility insider ring

Agent: Okay. I can certainly assist you with


that.

Customer: Do I need to sign up for the


preview ring to join the accessibility league?

Agent: No. You can leave your console out


of Xbox Preview rings and still join the
League. However, note that some
experiences made available to you may
require that you join an Xbox Preview ring.

Customer: Okay. And I can just sign up for


preview ring later yeah?

Agent: That is correct.

Customer: Sweet.

Ideal output. The goal is to create summaries that follow this format: "Customer said x.
Agent responded y." Another goal is to capture salient features of the dialog, like the
customer complaint, suggested resolution, and follow-up actions.
Here's an example of a customer support interaction, followed by a comprehensive
human-written summary of it:

Dialog

Customer: Hello. I have a question about the game pass.

Agent: Hello. How are you doing today?

Customer: I'm good.

Agent. I see that you need help with the Xbox Game Pass.

Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.

Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off (or the best available
discounted price) to continue playing a game once it leaves the catalog.

Customer: Got it, thanks

Ground truth summary

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Zero-shot
The zero-shot approach is useful when you don't have ample labeled training data. In
this case, there aren't enough ground truth summaries. It's important to design prompts
carefully to extract relevant information. The following format is used to extract general
summaries from customer-agent chats:

prefix = "Please provide a summary of the conversation below: "

suffix = "The summary is as follows: "

Here's a sample that shows how to run a zero-shot model:

Python
rouge = Rouge()
# Run zero-shot prediction for all engines of interest
deploymentNames = ["curie-instruct","davinci-instruct"] # also known as
text-davinci/text-instruct
for deployment in deploymentNames:
url = openai.api_base + "openai/deployments/" + deployment + "/completions?
api-version=2022-12-01-preivew"
response_list = []
rouge_list = []
print("calling…" + deployment)
for i in range(len(test)):
response_i = openai.Completion.create(
engine = deployment,
prompt = build_prompt(prefix, [test['prompt'][i]], suffix),
temperature = 0.0,
max_tokens = 400,
top_p = 1.0,
frequence_penalty = 0.5,
persence_penalty = 0.0,
stop=["end"] # We recommend that you adjust the stop sequence based on the
dataset
)
scores = rouge.get_scores(normalize_text(response_i[‘choices’][ 0]
['text']),test['completion'][i])
rouge_list += [scores[0]['rouge-1']['f']],
response_list += [response_i]
summary_list = [normalize_text(i[‘choices’][0][‘text’]) for i in
response_list]
test[deployment + "_zeroshotsummary"] = summary_list
test[deployment + "_zeroshotroguescore"] = rouge_list

Results and observations


The zero-shot model's output is produced directly from the base model. In this case,
both Curie and Davinci summarize the dialog fairly well. The only noticeable difference is
that the Curie model provides a little less detail. Curie starts the summary with
"customer asks the agent about the Xbox game pass." The corresponding sentence in
Davinci is "customer asked how long they could access games after they leave the Xbox
game pass catalog."

Dialog

Customer: Hello. I have a question about the game pass.

Agent: Hello. How are you doing?

Customer: I'm good.

Agent: I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.

Agent: Once a game leaves the Xbox game pass catalog you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.

Ground truth

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Davinci result

The customer asked how long they could access games after they leave the Xbox game
pass catalog. The agent told them that they would need to purchase a digital copy of
the game from the Xbox app for windows or the Microsoft store in order to continue
playing it. The agent also reminded the customer that they would be notified prior to a
game leaving the Xbox game pass catalog. as a member the customer could purchase
any game in the catalog for up to 20 off or the best available discounted price.

Curie result

The customer asks the agent about the Xbox game pass. the agent tells the customer
that once a game leaves the Xbox game pass catalog the customer will need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game. The agent
also reminds the customer that Xbox will notify members prior to a game leaving the
Xbox game pass catalog.

Few-shot
When you use the few-shot approach, the model is provided with a small number of
examples.

context_primer = "Below are examples of conversations and their corresponding

summaries:"

prefix = "Please provide a summary of the conversation below: "


suffix = "The summary is as follows: "

Here's a sample that shows how to run a few-shot model:

Python

train_small = train[]
train_small_json = train_small.to_dict(orient='records')
compiled_train_prompt = build_prompt_fewshot(prefix,context_primer,
train_small_json, suffix)

for deployment in deploymentNames:


url = openai.api_base + "openai/deployments/" + deployment + "/completions?
api-version=2022-12-01-preivew"
response_list = []
rouge_list = []
print("calling…" + deployment)
for i in range(len(test)):
response_i = openai.Completion.create(
engine = deployment,
prompt = compiled_train_prompt+build_prompt(prefix, [test['prompt'][i]],
suffix),
temperature = 0.0,
max_tokens = 400,
top_p = 1.0,
frequence_penalty = 0.5,
persence_penalty = 0.0,
stop=["end"] # We recommend that you adjust the stop sequence based on the
dataset
)
scores = rouge.get_scores(normalize_text(response_i['choices'][ 0]
['text']),test['completion'][i])
rouge_list += [scores[0]['rouge-1']['f']],
response_list += [response_i]
summary_list = [normalize_text(i['choices'][0]['text']) for i in
response_list]
test[deployment + "_fewshot"] = summary_list
test[deployment + "_FSscore1"] = rouge_list

Results and observations


With the few-shot approach, the summaries continue to capture salient features of the
conversation. The Davinci summary is more compact and closer to the ground truth.
Curie fabricates some trivial details.

Dialog

Customer: Hello. I have a question about the game pass.

Agent: Hello. How are you doing?


Customer: I'm good.

Agent: I see that you need help with the Xbox Game Pass.

Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.

Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.

Ground truth

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Davinci result

customer wanted to know how long they could access games after they leave game
pass. Agent informs that once a game leaves the Xbox game pass catalog the customer
would need to purchase a digital copy or obtain another form of entitlement to
continue playing the game.

Curie result

customer has a question about the game pass. customer is good. agent needs help with
the Xbox game pass. customer asks how long they can access the games after they
leave the game pass catalog. Agent informs that once a game leaves the Xbox game
pass catalog the customer will need to purchase a digital copy from the Xbox app for
windows or the Microsoft store play from a disc or obtain another form of entitlement
to continue playing the game. customer is happy to hear this and thanks agent.

Fine-tuning
Fine-tuning is the process of tailoring models to get a specific desired outcome from
your own datasets.

Here's an example format:


{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

Results and observations


Testing suggests that a fine-tuned Curie model leads to results that are comparable to
those of a Davinci few-shot model. Both summaries capture the customer's question
and the agent's answer without capturing the details about discounts and without
adding content. Both summaries are similar to the ground truth.

Dialog

Customer: Hello. I have a question about the game pass.

Agent: Hello. How are you doing?

Customer: I'm good.

Agent: I see that you need help with the Xbox Game Pass.

Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.

Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.

Ground truth

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Curie result

customer wants to know how long they can access the games after they leave game
pass. agent explains that once a game leaves the Xbox game pass catalog they'll need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game.
Conclusions
Generally, the Davinci model requires fewer instructions to perform tasks than other
models, such as Curie. Davinci is better suited for summarizing text that requires an
understanding of context or specific language. Because Davinci is the most complex
model, its latency is higher than that of other models. Curie is faster than Davinci and is
capable of summarizing conversations.

These tests suggest that you can generate better summaries when you provide more
instruction to the model via few-shot or fine-tuning. Fine-tuned models are better at
conforming to the structure and context learned from a training dataset. This capability
is especially useful when summaries are domain specific (for example, generating
summaries from a doctor's notes or online-prescription customer support). If you use
fine-tuning, you have more control over the types of summaries that you see.

For the sake of easy comparison, here's a summary of the results that are presented
earlier:

Ground truth

Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.

Davinci zero-shot result

The customer asked how long they could access games after they leave the Xbox game
pass catalog. The agent told them that they would need to purchase a digital copy of
the game from the Xbox app for windows or the Microsoft store in order to continue
playing it. The agent also reminded the customer that they would be notified prior to a
game leaving the Xbox game pass catalog. As a member the customer could purchase
any game in the catalog for up to 20 off or the best available discounted price.

Curie zero-shot result

The customer asks the agent about the Xbox game pass. the agent tells the customer
that once a game leaves the Xbox game pass catalog the customer will need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game. The agent
also reminds the customer that Xbox will notify members prior to a game leaving the
Xbox game pass catalog.

Davinci few-shot result


customer wanted to know how long they could access games after they leave game
pass. Agent informs that once a game leaves the Xbox game pass catalog the customer
would need to purchase a digital copy or obtain another form of entitlement to
continue playing the game.

Curie few-shot result

customer has a question about the game pass. customer is good. agent needs help with
the Xbox game pass. customer asks how long they can access the games after they
leave the game pass catalog. Agent informs that once a game leaves the Xbox game
pass catalog the customer will need to purchase a digital copy from the Xbox app for
windows or the Microsoft store play from a disc or obtain another form of entitlement
to continue playing the game. customer is happy to hear this and thanks agent.

Curie fine-tuning result

customer wants to know how long they can access the games after they leave game
pass. agent explains that once a game leaves the Xbox game pass catalog they'll need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game.

Evaluating summarization
There are multiple techniques for evaluating the performance of summarization models.

Here are a few:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation). This technique includes


measures for automatically determining the quality of a summary by comparing it to
ideal summaries created by humans. The measures count the number of overlapping
units, like n-gram, word sequences, and word pairs, between the computer-generated
summary that's being evaluated and the ideal summaries.

Here's an example:

Python

reference_summary = "The cat ison porch by the tree"


generated_summary = "The cat is by the tree on the porch"
rouge = Rouge()
rouge.get_scores(generated_summary, reference_summary)
[{'rouge-1': {'r':1.0, 'p': 1.0, 'f': 0.999999995},
'rouge-2': {'r': 0.5714285714285714, 'p': 0.5, 'f': 0.5333333283555556},
'rouge-1': {'r': 0.75, 'p': 0.75, 'f': 0.749999995}}]
BertScore. This technique computes similarity scores by aligning generated and
reference summaries on a token level. Token alignments are computed greedily to
maximize the cosine similarity between contextualized token embeddings from BERT.

Here's an example:

Python

import torchmetrics
from torchmetrics.text.bert import BERTScore
preds = "You should have ice cream in the summer"
target = "Ice creams are great when the weather is hot"
bertscore = BERTScore()
score = bertscore(preds, target)
print(score)

Similarity matrix. A similarity matrix is a representation of the similarities between


different entities in summarization evaluation. You can use it to compare different
summaries of the same text and measure their similarity. It's represented by a two-
dimensional grid, where each cell contains a measure of the similarity between two
summaries. You can measure the similarity by using a variety of methods, like cosine
similarity, Jaccard similarity, and edit distance. You then use the matrix to compare the
summaries and determine which one is the most accurate representation of the original
text.

Here's a sample command that generates the similarity matrix of a BERTScore


comparison of two similar sentences:

Python

bert-score-show --lang en -r "The cat is on the porch by the tree"


-c "The cat is by the tree on the porch"
-f out.png

The first sentence, "The cat is on the porch by the tree," is referred to as the candidate.
The second sentence is referred as the reference. The command uses BERTScore to
compare the sentences and generate a matrix.

This following matrix displays the output that's generated by the preceding command:

For more information, see SummEval: Reevaluating Summarization Evaluation . For a


PyPI toolkit for summarization, see summ-eval 0.892 .

Responsible use
GPT can produce excellent results, but you need to check the output for social, ethical,
and legal biases and harmful results. When you fine-tune models, you need to remove
any data points that might be harmful for the model to learn. You can use red teaming
to identify any harmful outputs from the model. You can implement this process
manually and support it by using semi-automated methods. You can generate test cases
by using language models and then use a classifier to detect harmful behavior in the
test cases. Finally, you should perform a manual check of generated summaries to
ensure that they're ready to be used.

For more information, see Red Teaming Language Models with Language Models .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Meghna Jani | Data & Applied Scientist II

Other contributor:

Mick Alberts | Technical Writer

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps
More information about Azure OpenAI
ROUGE reference article
Training module: Introduction to Azure OpenAI Service
Learning path: Develop AI solutions with Azure OpenAI

Related resources
Query-based document summarization
Choose a Microsoft cognitive services technology
Natural language processing technology

Você também pode gostar