Escolar Documentos
Profissional Documentos
Cultura Documentos
AI concepts
Algorithm
An algorithm is a sequence of calculations and rules used to solve a problem or analyze
a set of data. It is like a flow chart, with step-by-step instructions for questions to ask,
but written in math and programming code. An algorithm may describe how to
determine whether a pet is a cat, dog, fish, bird, or lizard. Another far more complicated
algorithm may describe how to identify a written or spoken language, analyze its words,
translate them into a different language, and then check the translation for accuracy.
Machine learning
Machine learning (ML) is an AI technique that uses mathematical algorithms to create
predictive models. An algorithm is used to parse data fields and to "learn" from that
data by using patterns found within it to generate models. Those models are then used
to make informed predictions or decisions about new data.
The predictive models are validated against known data, measured by performance
metrics selected for specific business scenarios, and then adjusted as needed. This
process of learning and validation is called training. Through periodic retraining, ML
models are improved over time.
Deep learning
Deep learning is a type of ML that can determine for itself whether its predictions are
accurate. It also uses algorithms to analyze data, but it does so on a larger scale than
ML.
Deep learning uses artificial neural networks, which consist of multiple layers of
algorithms. Each layer looks at the incoming data, performs its own specialized analysis,
and produces an output that other layers can understand. This output is then passed to
the next layer, where a different algorithm does its own analysis, and so on.
With many layers in each neural network-and sometimes using multiple neural
networks-a machine can learn through its own data processing. This requires much
more data and much more computing power than ML.
Bots
A bot is an automated software program designed to perform a particular task. Think of
it as a robot without a body. Early bots were comparatively simple, handling repetitive
and voluminous tasks with relatively straightforward algorithmic logic. An example
would be web crawlers used by search engines to automatically explore and catalog
web content.
Bots have become much more sophisticated, using AI and other technologies to mimic
human activity and decision-making, often while interacting directly with humans
through text or even speech. Examples include bots that can take a dinner reservation,
chatbots (or conversational AI) that help with customer service interactions, and social
bots that post breaking news or scientific data to social media sites.
Microsoft offers the Azure Bot Service, a managed service purpose-built for enterprise-
grade bot development.
Autonomous systems
Autonomous systems are part of an evolving new class that goes beyond basic
automation. Instead of performing a specific task repeatedly with little or no variation
(like bots do), autonomous systems bring intelligence to machines so they can adapt to
changing environments to accomplish a desired goal.
Microsoft AI School
Microsoft AI Blog
Prebuilt AI
Prebuilt AI is exactly what it sounds like-off-the-shelf AI models, services, and APIs that
are ready to use. These help you add intelligence to apps, websites, and flows without
having to gather data and then build, train, and publish your own models.
You can build and train your own models, but AI Builder also provides select prebuilt AI
models that are ready for use right away. For example, you can add a component in
Microsoft Power Apps based on a prebuilt model that recognizes contact information
from business cards.
Custom AI
Although prebuilt AI is useful (and increasingly flexible), the best way to get what you
need from AI is probably to build a system yourself. This is obviously a very deep and
complex subject, but let's look at some basic concepts beyond what we've just covered.
Code languages
The core concept of AI is the use of algorithms to analyze data and generate models to
describe (or score) it in ways that are useful. Algorithms are written by developers and
data scientists (and sometimes by other algorithms) using programming code. Two of
the most popular programming languages for AI development are currently Python and
R.
PyTorch. An open-source Python library with a rich ecosystem that can be used
for deep learning, computer vision, natural language processing, and more
Microsoft has fully embraced the R programming language and provides many different
options for R developers to run their code in Azure.
Training
During the training phase, a quality set of known data is tagged so that individual fields
are identifiable. The tagged data is fed to an algorithm configured to make a particular
prediction. When finished, the algorithm outputs a model that describes the patterns it
found as a set of parameters. During validation, fresh data is tagged and used to test
the model. The algorithm is adjusted as needed and possibly put through more training.
Finally, the testing phase uses real-world data without any tags or preselected targets.
Assuming the model's results are accurate, it is considered ready for use and can be
deployed.
Hyperparameter tuning
Hyperparameters are data variables that govern the training process itself. They are
configuration variables that control how the algorithm operates. Hyperparameters are
thus typically set before model training begins and are not modified within the training
process in the way that parameters are. Hyperparameter tuning involves running trials
within the training task, assessing how well they are getting the job done, and then
adjusting as needed. This process generates multiple models, each trained using
different families of hyperparameters.
Automated machine learning, also known as AutoML, is the process of automating the
time-consuming, iterative tasks of machine learning model development. It can
significantly reduce the time it takes to get production-ready ML models. Automated
ML can assist with model selection, hyperparameter tuning, model training, and other
tasks, without requiring extensive programming or domain knowledge.
Scoring
Scoring is also called prediction and is the process of generating values based on a
trained machine learning model, given some new input data. The values, or scores, that
are created can represent predictions of future values, but they might also represent a
likely category or outcome. The scoring process can generate many different types of
values:
A probability value, indicating the likelihood that a new input belongs to some
existing category
Batch scoring is when data is collected during some fixed period of time and then
processed in a batch. This might include generating business reports or analyzing
customer loyalty.
What is Azure Machine Learning? General orientation with links to many learning
resources, SDKs, documentation, and more
Automate machine learning activities with the Azure Machine Learning CLI
Quickstart: Create an Azure Cognitive Search cognitive skill set in the Azure portal
The Microsoft Machine Learning library for Apache Spark is MMLSpark (Microsoft ML
for Apache Spark). It is an open-source library that adds many deep learning and data
science tools, networking capabilities, and production-grade performance to the Spark
ecosystem. Learn more about MMLSpark features and capabilities.
Azure HDInsight overview. Basic information about features, cluster architecture,
and use cases, with pointers to quickstarts and tutorials.
GitHub repo for MMLSpark: Microsoft Machine Learning library for Apache Spark
Databricks Runtime for Machine Learning (Databricks Runtime ML) lets you start a
Databricks cluster with all of the libraries required for distributed training. It provides a
ready-to-go environment for machine learning and data science. Plus, it contains
multiple popular libraries, including TensorFlow, PyTorch, Keras, and XGBoost. It also
supports distributed training using Horovod.
Customer stories
Different industries are applying AI in innovative and inspiring ways. Following are a
number of customer case studies and success stories:
ASOS: Online retailer solves challenges with Azure Machine Learning service
KPMG helps financial institutions save millions in compliance costs with Azure
Cognitive Services
Buncee: NYC school empowers readers of all ages and abilities with Azure AI
Zencity: Data-driven startup uses funding to help local governments support better
quality of life for residents
Bosch uses IoT innovation to drive traffic safety improvements by helping drivers
avoid serious accidents
Wix deploys smart, scalable search across 150 million websites with Azure
Cognitive Search
AXA Global P&C: Global insurance firm models complex natural disasters with
cloud-based HPC
Next steps
To learn about the artificial intelligence development products available from
Microsoft, refer to the Microsoft AI platform page.
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This article describes how you can use Microsoft AI to improve website content tagging
accuracy by combining deep learning and natural language processing (NLP) with data
on site-specific search terms.
Architecture
Dataflow
1. Data is stored in various formats, depending on its original source. Data can be
stored as files within Azure Data Lake Storage or in tabular form in Azure Synapse
or Azure SQL Database.
2. Azure Machine Learning (ML) can connect and read from such sources, to ingest
the data into the NLP pipeline for pre-processing, model training, and post-
processing.
3. NLP pre-processing includes several steps to consume data, with the purpose of
text generalization. Once the text is broken up into sentences, NLP techniques,
such as lemmatization or stemming, allow the language to be tokenized in a
general form.
4. As NLP models are already available pre-trained, the transfer learning approach
recommends that you download language-specific embeddings and use an
industry standard model, for multi-class text classification, such as variations of
BERT .
6. The model can be deployed through Azure Kubernetes Service, while running a
Kubernetes-managed cluster where the containers are deployed from images that
are stored in Azure Container Registry. Endpoints can be made available to a front-
end application. The model can be deployed through Azure Kubernetes Service as
real-time endpoints.
7. Model results can be written to a storage option in file or tabular format, then
properly indexed by Azure Cognitive Search. The model would run as batch
inference and store the results in the respective datastore.
Components
Data Lake Storage for Big Data Analytics
Azure Machine Learning
Azure Cognitive Search
Azure Container Registry
Azure Kubernetes Service (AKS)
Scenario details
Social sites, forums, and other text-heavy Q&A services rely heavily on content tagging,
which enables good indexing and user search. Often, however, content tagging is left to
users' discretion. Because users don't have lists of commonly searched terms or a deep
understanding of the site structure, they frequently mislabel content. Mislabeled content
is difficult or impossible to find when it's needed later.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
See the product documentation:
Related resources
See the following related architectural articles:
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This architecture shows how knowledge mining can help customer support teams
quickly find answers to customer questions or assess customer sentiment at scale.
Architecture
There are three steps in knowledge mining: ingest, enrich, and explore.
Dataflow
Ingest
The ingest step aggregates content from a range of sources, including structured and
unstructured data. For customer support and feedback analysis, you can ingest different
types of content. This content includes customer support tickets, chat logs, call
transcriptions, customer emails, customer payment history, product reviews, social
media feeds, online comments, feedback forms, and surveys.
Enrich
The enrich step uses AI capabilities to extract information, find patterns, and deepen
understanding. You can enrich content by using key phrase extraction, sentiment
analysis, language translation, bot services, custom models to focus on specific products
or company policies.
Explore
The explore step is explorer data via search, existing business applications, or analytics
solutions. For example, you can compile enriched documents in the knowledge store
and project them into tabular or object stores. The stores can be used to surface trends
in an analytics dashboard identifying frequent issues or popular products. Or, you can
integrate the search index into customer service support applications.
Components
The following key technologies are used to implement tools for technical content review
and research:
Scenario details
For many companies, customer support is costly and doesn't always operate efficiently.
Knowledge mining can help customer support teams quickly find the best answers to
customer questions or assess customer sentiment at scale.
Azure Cognitive Search is a key part of knowledge mining solutions. Azure Cognitive
Search creates a search index over aggregated and analyzed content.
With queries using the search index, companies can discover trends about what
customers are saying and use that information to improve products and services.
Next steps
To build an initial knowledge mining prototype with Azure Cognitive Search, use
the knowledge mining solution accelerator.
Explore the learning path Knowledge mining with Azure Cognitive Search.
To learn more about the components in this solution, see these resources:
Azure Cognitive Search documentation
Text analytics REST API reference - Azure Cognitive Services
What is Azure Cognitive Services Translator?
Related resources
Azure Cognitive Search
Text analytics
Large-scale custom natural
language processing
Azure Computer Vision Azure Data Lake Storage Azure Databricks Azure HDInsight
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
Implement a custom natural language processing (NLP) solution in Azure. Use Spark
NLP for tasks like topic and sentiment detection and analysis.
Apache®, Apache Spark , and the flame logo are either registered trademarks or
trademarks of the Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is implied by the use of
these marks.
Architecture
Workflow
1. Azure Event Hubs, Azure Data Factory, or both services receive documents or
unstructured text data.
2. Event Hubs and Data Factory store the data in file format in Azure Data Lake
Storage. We recommend that you set up a directory structure that complies with
business requirements.
3. The Azure Computer Vision API uses its optical character recognition (OCR)
capability to consume the data. The API then writes the data to the bronze layer.
This consumption platform uses a lakehouse architecture.
4. In the bronze layer, various Spark NLP features preprocess the text. Examples
include splitting, correcting spelling, cleaning, and understanding grammar. We
recommend running document classification at the bronze layer and then writing
the results to the silver layer.
5. In the silver layer, advanced Spark NLP features perform document analysis tasks
like named entity recognition, summarization, and information retrieval. In some
architectures, the outcome is written to the gold layer.
6. In the gold layer, Spark NLP runs various linguistic visual analyses on the text data.
These analyses provide insight into language dependencies and help with the
visualization of NER labels.
7. Users query the gold layer text data as a data frame and view the results in Power
BI or web apps.
During the processing steps, Azure Databricks, Azure Synapse Analytics, and Azure
HDInsight are used with Spark NLP to provide NLP functionality.
Components
Data Lake Storage is a Hadoop-compatible file system that has an integrated
hierarchical namespace and the massive scale and economy of Azure Blob Storage.
Azure Synapse Analytics is an analytics service for data warehouses and big data
systems.
Azure Databricks is an analytics service for big data that's easy to use, facilitates
collaboration, and is based on Apache Spark. Azure Databricks is designed for data
science and data engineering.
Event Hubs ingests data streams that client applications generate. Event Hubs
stores the streaming data and preserves the sequence of received events.
Consumers can connect to hub endpoints to retrieve messages for processing.
Event Hubs integrates with Data Lake Storage, as this solution shows.
Azure HDInsight is a managed, full-spectrum, open-source analytics service in the
cloud for enterprises. You can use open-source frameworks with Azure HDInsight,
such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm,
and R.
Data Factory automatically moves data between storage accounts of differing
security levels to ensure separation of duties.
Computer Vision uses text recognition APIs to recognize text in images and
extract that information. The Read API uses the latest recognition models, and is
optimized for large, text-heavy documents and noisy images. The OCR API isn't
optimized for large documents but supports more languages than the Read API.
This solution uses OCR to produce data in the hOCR format.
Scenario details
Natural language processing (NLP) has many uses: sentiment analysis, topic detection,
language detection, key phrase extraction, and document categorization.
For customized NLP workloads, the open-source library Spark NLP serves as an efficient
framework for processing a large amount of text. This article presents a solution for
large-scale custom NLP in Azure. The solution uses Spark NLP features to process and
analyze text. For more information about Spark NLP, see Spark NLP functionality and
pipelines, later in this article.
Name entity extraction (NER): In Spark NLP, with a few lines of code, you can train
a NER model that uses BERT, and you can achieve state-of-the-art accuracy. NER is
a subtask of information extraction. NER locates named entities in unstructured
text and classifies them into predefined categories such as person names,
organizations, locations, medical codes, time expressions, quantities, monetary
values, and percentages. Spark NLP uses a state-of-the-art NER model with BERT.
The model is inspired by a former NER model, bidirectional LSTM-CNN. That
former model uses a novel neural network architecture that automatically detects
word-level and character-level features. For this purpose, the model uses a hybrid
bidirectional LSTM and CNN architecture, so it eliminates the need for most
feature engineering.
Sentiment and emotion detection: Spark NLP can automatically detect positive,
negative, and neutral aspects of language.
Part of speech (POS): This functionality assigns a grammatical label to each token
in input text.
Spark NLP is by far the fastest open-source NLP library. Recent public benchmarks show
Spark NLP as 38 and 80 times faster than spaCy , with comparable accuracy for
training custom models. Spark NLP is the only open-source library that can use a
distributed Spark cluster. Spark NLP is a native extension of Spark ML that operates
directly on data frames. As a result, speedups on a cluster result in another order of
magnitude of performance gain. Because every Spark NLP pipeline is a Spark ML
pipeline, Spark NLP is well-suited for building unified NLP and machine learning
pipelines such as document classification, risk prediction, and recommender pipelines.
Besides excellent performance, Spark NLP also delivers state-of-the-art accuracy for a
growing number of NLP tasks. The Spark NLP team regularly reads the latest relevant
academic papers and produces the most accurate models.
For the execution order of an NLP pipeline, Spark NLP follows the same development
concept as traditional Spark ML machine learning models. But Spark NLP applies NLP
techniques. The following diagram shows the core components of a Spark NLP pipeline.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
Spark NLP documentation:
Spark NLP
Spark NLP general documentation
Spark NLP GitHub
Spark NLP demo
Azure components:
Data in Azure Machine Learning
What is Azure HDInsight?
Data Lake Storage
Azure Synapse Analytics
Event Hubs
Azure HDInsight
Data Factory
Computer Vision APIs
Related resources
Natural language processing technology
AI enrichment with image and natural language processing in Azure Cognitive
Search
Analyze news feeds with near real-time analytics using image and natural language
processing
Suggest content tags with NLP using deep learning
Image classification with
convolutional neural networks
(CNNs)
Azure Blob Storage Azure Container Registry Azure Data Science Virtual Machines
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
Use convolutional neural networks (CNNs) to classify large volumes of images efficiently
to identify elements in images.
Architecture
Dataflow
1. Image uploads to Azure Blob Storage are ingested by Azure Machine Learning.
2. Because the solution follows a supervised learning approach and needs data
labeling to train the model, the ingested images are labeled in Machine Learning.
3. The CNN model is trained and validated in the Machine Learning notebook.
Several pre-trained image classification models are available. You can use them by
using a transfer learning approach. For information about some variants of pre-
trained CNNs, see Advancements in image classification using convolutional neural
networks . You can download these image classification models and customize
them with your labeled data.
4. After training, the model is stored in a model registry in Machine Learning.
5. The model is deployed through batch managed endpoints.
6. The model results are written to Azure Cosmos DB and consumed through the
front-end application.
Components
Blob Storage is a service that's part of Azure Storage . Blob Storage offers
optimized cloud object storage for large amounts of unstructured data.
Machine Learning is a cloud-based environment that you can use to train,
deploy, automate, manage, and track machine learning models. You can use the
models to forecast future behavior, outcomes, and trends.
Azure Cosmos DB is a globally distributed, multi-model database. With Azure
Cosmos DB, your solutions can elastically scale throughput and storage across any
number of geographic regions.
Azure Container Registry builds, stores, and manages container images and can
store containerized machine learning models.
Scenario details
With the rise of technologies such as the Internet of Things (IoT) and AI, the world is
generating large amounts of data. Extracting relevant information from the data has
become a major challenge. Image classification is a relevant solution to identifying what
an image represents. Image classification can help you categorize high volumes of
images. Convolutional neural networks (CNNs) render good performance on image
datasets. CNNs have played a major role in the development of state-of-the-art image
classification solutions.
Convolutional layers
Pooling layers
Fully connected layers
The convolutional layer is the first layer of a convolutional network. This layer can follow
another convolutional layer or pooling layers. In general, the fully connected layer is the
final layer in the network.
As the number of layers increases, the complexity of the model increases, and the model
can identify greater portions of the image. The beginning layers focus on simple
features, such as edges. As the image data advances through the layers of the CNN, the
network starts recognizing more sophisticated elements or shapes in the object. Finally,
it identifies the expected object.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributor.
Principal author:
Next steps
To learn more about Blob Storage, see Introduction to Azure Blob Storage.
To learn more about Container Registry, see Introduction to Container registries in
Azure.
To learn more about model management (MLOps), see MLOps: Model
management, deployment, lineage, and monitoring with Azure Machine Learning.
To browse an implementation of this solution idea on GitHub, see Synapse
Machine Learning .
To explore a Microsoft Learn module that includes a section on CNNs, see Train
and evaluate deep learning models.
Related resources
Visual search in retail with Azure Cosmos DB
Retail assistant with visual
capabilities
Azure App Service Bing Custom Search Bing Visual Search Azure AI Bot Service Azure AI services
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This solution idea uses Azure services with a bot assistant to improve interactions with
customers and provide suggestions based on visual information.
Architecture
Application 8 7
3
Mobile 1 2
4
Language
Azure App Service Azure Bot Service
Understanding
Web browser
5 6
User Input
Microsoft
Azure
Dataflow
1. The user uses an application, which is hosted on Azure App Service, either via a
web browser or a mobile device.
2. App Service communicates with Azure Bot Service to facilitate the interaction
between the user and the application.
3. Bot Service uses Azure Cognitive Services Language Understanding to identify user
intents and meaning.
4. Language Understanding (LUIS) returns the identified user intent to the Azure bot.
5. The bot passes a visual context input, such as an image, to the Bing Visual Search
API.
6. The API returns output to Bot Service.
7. Optionally, the bot retrieves more information for user queries within the user's
domain by using the Bing Custom Search API.
8. The Custom Search API returns output to Bot Service.
Components
App Service provides a framework for building, deploying, and scaling web apps.
Bot Service provides an integrated development environment for bot building.
Cognitive Services consists of cloud-based services that provide AI functionality.
Azure Cognitive Service for Language is part of Cognitive Services that offers
many natural language processing services.
Conversational language understanding is a feature of Cognitive Service for
Language. This cloud-based API service offers machine-learning intelligence
capabilities for building conversational apps. You can use language understanding
(LUIS) to predict the meaning of a conversation and pull out relevant, detailed
information.
The Bing Visual Search API returns data that's related to a given image, such as
similar images, shopping sources for purchasing the item in the image, and
webpages that include the image.
The Bing Custom Search API provides a way to create tailored ad-free search
experiences for topics.
Scenario details
This solution features a bot assistant with search integration. The bot can help
customers interact with a business application. It can also provide suggestions based on
visual information.
Next steps
What is Azure Cognitive Services?
What is Language Understanding (LUIS)?
Bing Search API documentation
What is the Bing Visual Search API?
What is the Bing Custom Search API?
App Service overview
Azure Bot Service documentation
Introduction to Bot Framework Composer
Related resources
Visual assistant
Artificial intelligence (AI) - Architectural overview
Choose a Microsoft Azure Cognitive Services technology
Visual assistant
Azure App Service Azure AI Bot Service Azure AI services
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This solution presents a visual assistant that provides rich information that's based on
the content of an image.
Architecture
Language
Azure Bot Service
Understanding
4 5 6
Bing Visual Search
Microsoft
Azure
Components
Azure App Service is a fully managed HTTP-based service for hosting web apps,
REST APIs, and mobile backends.
Azure Bot Service offers an environment for developing intelligent, enterprise-
grade bots that enrich customer experiences. The integrated environment also
provides a way to maintain control of your data.
The Bing Custom Search API provides a way to create customized search
experiences with Bing's powerful ranking and global-scale search index.
The Bing Entity Search API offers search capabilities that identify relevant
entities, such as well-known people, places, movies, TV shows, video games, books,
and businesses.
The Bing Visual Search API returns data that's related to a given image, such as
similar images, shopping sources for purchasing the item in the image, and
webpages that include the image.
The Bing Web Search API provides search results after you issue a single API call.
The results compile relevant information from billions of webpages, images,
videos, and news.
Azure Cognitive Service for Language is part of Azure Cognitive Services that
offers many natural language processing services.
Conversational language understanding is a feature of Cognitive Service for
Language. This cloud-based API service offers machine-learning intelligence
capabilities for building conversational apps. You can use LUIS to predict the
meaning of a conversation and pull out relevant, detailed information.
Scenario details
This solution presents a visual assistant that provides rich information that's based on
the content of an image. The assistant's capabilities include reading business cards,
deciphering barcodes, and recognizing well-known people, places, objects, artwork, and
monuments.
Appointment scheduling.
Order and delivery tracking in manufacturing, automotive, and transportation
applications.
Barcode purchases in retail.
Payment processing in finance and retail.
Subscription renewals in retail.
The identification of well-known people, places, objects, art, and monuments, in
the education, media, and entertainment industries.
Next steps
To design an app that detects context that matters to you, see Quickstart: Create
an object detection project with the Custom Vision client library.
To explore the search capabilities that Bing provides, see Bing family of search
APIs.
To build LUIS into your bot, see Add natural language understanding to your bot.
To explore a Learn module about how LUIS works, see Create a language model
with Conversational Language Understanding.
To learn how to build with Bot Service, see Build a bot with the Language Service
and Azure Bot Service.
To create a bot that incorporates QnA Maker and Bot Service, see Create
conversational AI solutions.
To solidify your understanding of LUIS, Bot Service, and the Bing Visual Search API,
see Exam AI-900: Microsoft Azure AI Fundamentals.
To certify your knowledge about Cognitive Services, see Microsoft Certified: Azure
AI Engineer Associate.
To learn more about the components in this solution, see these resources:
App Service overview
Azure Bot Service documentation
What is Bing Custom Search?
What is Bing Entity Search API?
What is the Bing Visual Search API?
What is the Bing Web Search API?
What is Language Understanding (LUIS)?
Related resources
Artificial intelligence (AI) - Architectural overview
Image classification on Azure
Retail assistant with visual capabilities
Vision classifier model with Azure
Custom Vision Cognitive Service
Azure GitHub
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This architecture uses Custom Vision to classify images taken by a simulated drone. It
provides a way to combine AI and the Internet of Things (IoT). Azure Custom Vision can
also be used for object detection purpose.
Architecture
Workflow
1. Use AirSim's 3D-rendered environment to take images taken with the drone. Use
the images as the training dataset.
2. Import and tag the dataset in a Custom Vision project. The cognitive service trains
and tests the model.
3. Export the model into TensorFlow format so you can use it locally.
4. The model can also be deployed to a container or to mobile devices.
Components
TensorFlow
TensorFlow is an open-source platform for machine learning (ML). It's a tool that helps
you develop and train ML models. When you export your model to TensorFlow format,
you'll have a protocol buffer file with the Custom Vision model that you can use locally
in your script.
Scenario details
Azure Cognitive Services offers many possibilities for Artificial Intelligence (AI) solutions.
One of them is Azure Custom Vision, which allows you to build, deploy, and improve
your image classifiers. This architecture uses Custom Vision to classify images taken by a
simulated drone. It provides a way to combine AI and the Internet of Things (IoT). Azure
Custom Vision can also be used for object detection purpose.
Microsoft Search and Rescue Lab suggests a hypothetical use case for Custom Vision.
In the lab, you fly a Microsoft AirSim simulated drone around in a 3D-rendered
environment. You use the simulated drone to capture synthetic images of the animals in
that environment. After creating a dataset of images, you use the dataset to train a
Custom Vision classifier model. To train the model, you tag the images with the names
of the animals. When you fly the drone again, take new images of the animals. This
solution identifies the name of the animal in each new image.
In a practical application of the lab, an actual drone replaces the Microsoft AirSim
simulated drone. If a pet is lost, the owner provides images of the pet to the Custom
Vision model trainer. Just like in the simulation, the images are used to train the model
to recognize the pet. Then, the drone pilot searches an area where the lost pet might be.
As it finds animals along the way, the drone's camera can capture images and determine
if the animal is the lost pet.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
Learn more about Microsoft AirSim
Learn more about Azure Custom Vision Cognitive Service
Learn more about Azure Cognitive Services
Related resources
Read other Azure Architecture Center articles:
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This solution idea identifies speech in static video files to manage speech as standard
content.
Architecture
Azure Encoder
(Standard or
Premium)
Source Azure Blob Streaming Multi-Protocol Azure CDN Azure Media player
Audio/Video files Storage endpoint Dynamic Packaging/
Multi-DRM
TTML, WebVTT
Keywords
Azure Media
Indexer/OCR Media Azure Search
processor
Web Apps
Microsoft
Azure
Components
Blob Storage is a service that's part of Azure Storage . Blob Storage offers
optimized cloud object storage for large amounts of unstructured data.
Azure Media Services is a cloud-based platform that you can use to stream
video, enhance accessibility and distribution, and analyze video content.
Live and on-demand streaming is a feature of Azure Media Services that delivers
content to various devices at scale.
Azure Encoding provides a way to convert files that contain digital video or
audio from one standard format to another.
Azure Media Player plays videos that are in various formats.
Azure Content Delivery Network offers a global solution for rapidly delivering
content. This service provides your users with fast, reliable, and secure access to
your apps' static and dynamic web content.
Azure Cognitive Search is a cloud search service that supplies infrastructure,
APIs, and tools for searching. You can use Azure Cognitive Search to build search
experiences over private, heterogeneous content in web, mobile, and enterprise
applications.
App Service provides a framework for building, deploying, and scaling web apps.
The Web Apps feature is a service for hosting web applications, REST APIs, and
mobile back ends.
Azure Media Indexer provides a way to make content of your media files
searchable. It can also generate a full-text transcript for closed captioning and
keywords.
Scenario details
A speech-to-text solution provides a way to identify speech in static video files so you
can manage it as standard content. For instance, employees can use this technology to
search within training videos for spoken words or phrases. Then they can navigate to the
specific moment in the video that contains the word or phrase.
When you use this solution, you can upload static videos to an Azure website. The Azure
Media Indexer uses the Speech API to index the speech within the videos and stores it in
an Azure database. You can search for words or phrases by using the Web Apps feature
of Azure App Service. Then you can retrieve a list of results. When you select a result,
you can see the place in the video that mentions the word or phrase.
This solution is built on the Azure managed services Content Delivery Network and
Azure Cognitive Search .
Next steps
How to use Azure Blob Storage
How to encode an asset using Media Encoder
How to manage streaming endpoints
Using Azure Content Delivery Network
Develop video player applications
Create an Azure Cognitive Search service
Run Web Apps in the cloud
Indexing media files
Related resources
Gridwich cloud media system
Live stream digital media
Video-on-demand digital media
Customer churn prediction using
real-time analytics
Azure Machine Learning
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
Customer Churn Prediction uses Azure AI platform to predict churn probability, and it
helps find patterns in existing data that are associated with the predicted churn rate.
Architecture
Dataflow
1. Use Azure Event Hubs to stream all live data into Azure.
2. Process real-time data using Azure Stream Analytics . Stream Analytics can
output processed data into Azure Synapse . This allows customers to combine
existing and historical data to create dashboards and reports in Power BI.
3. Ingest historical data at scale into Azure Blob Storage using Azure Synapse or
another ETL tool.
4. Use Azure Synapse to combine streaming data with historical data for reporting
or experimentation in Azure Machine Learning .
5. Use Azure Machine Learning to build models for predicting churn probability
and identify data patterns to deliver intelligent insights.
Components
Azure Event Hubs is an event ingestion service that can process millions of
events per second. Data sent to event hub can be transformed and stored using
any real-time analytics provider.
Azure Stream Analytics is a real-time analytics engine designed to analyze and
process high volume of fast streaming data. Relationships and patterns identified
in the data can be used to trigger actions and initiate workflows such as creating
alerts, feeding information to a reporting tool, or storing transformed data for later
use.
Azure Blob Storage is a cloud service for storing large amounts of unstructured
data such as text, binary data, audio, and documents more-easily and cost-
effectively. Azure Blob Storage allows data scientists quick access to data for
experimentation and AI model building.
Azure Synapse Analytics is a fast and reliable data warehouse with limitless
analytics that brings together data integration, enterprise data warehousing, and
big data analytics. It gives you the freedom to query data on your terms, using
either serverless or dedicated resources and serve data for immediate BI and
machine learning needs.
Azure Machine Learning can be used for any supervised and unsupervised
machine learning, whether you prefer to write Python of R code. You can build,
train, and track machine learning models in an Azure Machine Leaning workspace.
Power BI is a suite of tools that delivers powerful insights to organizations.
Power BI connects to various data sources, simplify data prep and model creation
from disparate sources. Enhance team collaboration across the organization to
produce analytical reports and dashboard to support the business decisions and
publish them to the web and mobile devices for users to consume.
Scenario details
Keeping existing customers is five times cheaper than the cost of getting new
customers. For this reason, marketing executives often find themselves trying to
estimate the likelihood of customer churn and finding the necessary actions to minimize
the churn rate.
The objective of this guide is to demonstrate predictive data pipelines for retailers to
predict customer churn. Retailers can use these predictions to prevent customer churn
by using their domain knowledge and proper marketing strategies to address at-risk
customers. The guide also shows how customer churn models can be retrained to use
more data as it becomes available.
Solution dashboard
The snapshot below shows an example Power BI dashboard that gives insights into the
predicted churn rates across a customer base.
Next steps
About Azure Event Hubs
Welcome to Azure Stream Analytics
What is Azure Synapse Analytics?
Introduction to Azure Blob Storage
What is Azure Machine Learning?
What is Power BI?
Related resources
Architecture guides:
Reference architectures:
Batch scoring for deep learning models
Batch scoring of Python models on Azure
Build a speech-to-text transcription pipeline
Personalized offers
Azure Event Hubs Azure Functions Azure Machine Learning Azure Storage Azure Stream Analytics
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
Architecture
User activity Ingest Storage Analyze Visualize
1 3 4
Aggregated
Data 6
Azure Cosmos DB Intelligent Power BI
Azure Event Hubs
(SQL API) Recommendations
11
Products Function App 2
Offers 7
Raw Stream Data
Azure Stream Azure Data Lake
Analytics Storage
References
8 9
Product Views
Microsoft
Azure
Dataflow
1. An Azure Function app captures the raw user activity (such as product and offer
clicks) and offers that are made to users on the website. The activity is sent to
Azure Event Hubs. In areas where user activity is not available, the simulated user
activity is stored in Azure Cache for Redis.
2. Azure Stream Analytics analyzes the data to provide near real-time analytics on the
input stream from the Azure Event Hubs instance.
3. The aggregated data is sent to Azure Cosmos DB for NoSQL.
4. Power BI is used to look for insights on the aggregated data.
5. The raw data is sent to Azure Data Lake Storage.
6. Intelligent Recommendations uses the raw data from Azure Data Lake Storage and
provides recommendations to Azure Personalizer.
7. The Personalizer service serves the top contextual and personalized products and
offers.
8. Simulated user activity data is provided to the Personalizer service to provide
personalized products and offers.
9. The results are provided on the web app that the user accesses.
10. User feedback is captured based on the reaction of the user to the displayed offers
and products. The reward score is provided to the Personalizer service to make it
perform better over time
11. Retraining for Intelligent Recommendations can result in better recommendations.
This process can also be done by using refreshed data from Azure Data Lake
Storage.
Components
Event Hubs is a fully managed streaming platform. In this solution, Event Hubs
collects real-time consumption data.
Stream Analytics offers real-time serverless stream processing. This service
provides a way to run queries in the cloud and on edge devices. In this solution,
Stream Analytics aggregates the streaming data and makes it available for
visualization and updates.
Azure Cosmos DB is a globally distributed, multi-model database. With Azure
Cosmos DB, your solutions can elastically scale throughput and storage across any
number of geographic regions. The Azure Cosmos DB for NoSQL stores data in
document format and is one of several database APIs that Azure Cosmos DB
offers. In the GitHub implementation of this solution, DocumentDB was used to
store the customer, product, and offer information, but you can also use Azure
Cosmos DB for NoSQL. For more information, see Dear DocumentDB customers,
welcome to Azure Cosmos DB! .
Storage is a cloud storage solution that includes object, file, disk, queue, and
table storage. Services include hybrid storage solutions and tools for transferring,
sharing, and backing up data. This solution uses Storage to manage the queues
that simulate user interaction.
Functions is a serverless compute platform that you can use to build
applications. With Functions, you can use triggers and bindings to integrate
services. This solution uses Functions to coordinate the user simulation. Functions
is also the core component that generates personalized offers.
Machine Learning is a cloud-based environment that you can use to train,
deploy, automate, manage, and track machine learning models. Here, Machine
Learning uses each user's preferences and product history to provide the user-to-
product affinity scoring.
Azure Cache for Redis provides an in-memory data store that's based on Redis
software. Azure Cache for Redis provides open-source Redis capabilities as a fully
managed offering. In this solution, Azure Cache for Redis provides pre-computed
product affinities for customers with no available user history.
Power BI is a business analytics service that provides interactive visualizations
and business intelligence capabilities. Its easy-to-use interface makes it possible
for you to create your own reports and dashboards. This solution uses Power BI to
display real-time activity in the system. For instance, Power BI uses the data from
Azure Cosmos DB for NoSQL to display the customer response to various offers.
Data Lake Storage is a scalable storage repository that holds a large amount of
data in the data's native, raw format.
Solution details
In today's highly competitive and connected environment, modern businesses can no
longer survive on generic, static online content. Furthermore, marketing strategies that
use traditional tools can be expensive and hard to implement. As a result, they don't
produce the desired return on investment. These systems often fail to take full
advantage of collected data when they create a more personalized experience for users.
Presenting offers that are customized for each user has become essential to building
customer loyalty and remaining profitable. On a retail website, customers desire
intelligent systems that provide offers and content based on their unique interests and
preferences. Today's digital marketing teams can build this intelligence by using the data
that's generated from all types of user interactions.
Marketers now have the opportunity to deliver highly relevant and personalized offers
to each user by analyzing massive amounts of data. But building a reliable and scalable
big data infrastructure isn't trivial. And developing sophisticated machine learning
models that are personalized for each user is also a complex undertaking.
Microsoft Azure provides advanced analytics tools in the areas of data ingestion, data
storage, data processing, and advanced analytics components—all the essential
elements for building a personalized offer solution.
System integrator
You can save time when you implement this solution by hiring a trained system
integrator (SI). The SI can help you develop a proof of concept and can help deploy and
integrate the solution.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Related resources
Artificial intelligence (AI) - Architectural overview
Azure Machine Learning documentation
Optimize marketing with machine
learning
Azure AI services Azure Synapse Analytics Azure Machine Learning Azure Data Lake Power BI
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
Azure services can extract insights from social media for you to use in big data
marketing campaigns.
Architecture
1 2 4 5
External data
(Text, posts) Azure Synapse Azure Data Lake
Analytics
3
Azure Machine Microsoft
Learning Power BI
Microsoft
Azure
Dataflow
1. Azure Synapse Analytics enriches data in dedicated SQL pools with the model
that's registered in Azure Machine Learning via a stored procedure.
2. Azure Cognitive Services enriches the data by running sentiment analysis,
predicting overall meaning, extracting relevant information, and applying other AI
features. Machine Learning is used to develop a machine learning model and
register the model in the Machine Learning registry.
3. Azure Data Lake Storage provides storage for the machine learning data and a
cache for training the machine learning model.
4. The Web Apps feature of Azure App Service is used to create and deploy scalable
business-critical web applications. Power BI provides an interactive dashboard with
visualizations that use data that's stored in Azure Synapse Analytics to drive
decisions on the predictions.
Components
Azure Synapse Analytics is an integrated analytics service that accelerates time
to insight across data warehouses and big data systems.
Data Lake Storage is a massively scalable and secure data lake for high-
performance analytics workloads.
App Service provides a framework for building, deploying, and scaling web apps.
The Web Apps feature is a service for hosting web applications, REST APIs, and
mobile back ends.
Power BI is a collection of analytics services and apps. You can use Power BI to
connect and display unrelated sources of data.
Scenario details
Marketing campaigns are about more than the message that you deliver. When and
how you deliver that message is just as important. Without a data-driven, analytical
approach, campaigns can easily miss opportunities or struggle to gain traction.
These days, marketing campaigns are often based on social media analysis, which has
become increasingly important for companies and organizations around the world.
Social media analysis is a powerful tool that you can use to receive instant feedback on
products and services, improve interactions with customers to increase customer
satisfaction, keep up with the competition, and more. Companies often lack efficient,
viable ways to monitor social media conversations. As a result, they miss countless
opportunities to use these insights to inform their strategies and plans.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributors:
Next steps
Learn more with the following learning paths:
Related resources
Face recognition and sentiment analysis
Customer churn prediction using real-time analytics
Create personalized marketing
solutions in near real time
Azure Cosmos DB Azure Event Hubs Azure Functions Azure Machine Learning Azure Stream Analytics
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This architecture shows how you can create a solution personalizing offers with Azure
Functions, Azure Machine Learning, and Azure Stream Analytics.
Architecture
Cold start
Product Affinity
Cosmos DB Dashboard
(Azure Services)
Microsoft
Azure
Components
Event Hubs
Azure Stream Analytics
Azure Cosmos DB
Azure Storage
Azure Functions
Azure Machine Learning
Azure Cache for Redis
Power BI
Scenario details
Personalized marketing is essential for building customer loyalty and remaining
profitable. Reaching customers and getting them to engage is harder than ever, and
generic offers are easily missed or ignored. Current marketing systems fail to take
advantage of data that can help solve this problem.
Marketers using intelligent systems and analyzing massive amounts of data can deliver
highly relevant and personalized offers to each user, cutting through the clutter and
driving engagement. For example, retailers can provide offers and content based on
each customer's unique interests, preferences and product affinity, putting products in
front of the people most likely to buy them.
This architecture shows how you can create a solution personalizing offers with Azure
Functions, Azure Machine Learning, and Azure Stream Analytics.
Next steps
See the product documentation:
Related resources
Read other Azure Architecture Center articles:
This article describes how to use Azure OpenAI Service or Azure Cognitive Search to
search documents in your enterprise data and retrieve results to provide a ChatGPT-
style question and answer experience. This solution describes two approaches:
Azure Cognitive Search approach: Use Azure Cognitive Search to search and
retrieve relevant text data based on a user query. This service supports full-text
search, semantic search, vector search, and hybrid search.
7 Note
In Azure Cognitive Search, the semantic search and vector search features are
currently in public preview.
1
Storage Function apps Azure Cache for Azure App 1
User
accounts Redis 2 Service 4
2 4
3
Vectorize Return top k Results passed
Translate Create
Extract text query matching content with prompt
(optional) embeddings
3
Download a Visio file of this architecture.
Dataflow
Documents to be ingested can come from various sources, like files on an FTP server,
email attachments, or web application attachments. These documents can be ingested
to Azure Blob Storage via services like Azure Logic Apps, Azure Functions, or Azure Data
Factory. Data Factory is optimal for transferring bulk data.
Embedding creation:
1. The document is ingested into Blob Storage, and an Azure function is triggered to
extract text from the documents.
3. If the documents are PDFs or images, an Azure function can call Azure AI
Document Intelligence to extract the text. If the document is an Excel, CSV, Word,
or text file, python code can be used to extract the text.
4. The extracted text is then chunked appropriately, and an Azure OpenAI embedding
model is used to convert each chunk to embeddings.
5. These embeddings are persisted to the vector database. This solution uses the
Enterprise tier of Azure Cache for Redis, but any vector database can be used.
2. The Azure OpenAI embedding model is used to convert the query into vector
embeddings.
3. A vector similarity search that uses this query vector in the vector database returns
the top k matching content. The matching content to be retrieved can be set
according to a threshold that’s defined by a similarity measure, like cosine
similarity.
4. The top k retrieved content and the system prompt are sent to the Azure OpenAI
language model, like GPT-3.5 Turbo or GPT-4.
5. The search results are presented as the answer to the search query that was
initiated by the user, or the search results can be used as the grounding data for a
multi-turn conversation scenario.
Architecture: Azure Cognitive Search pull
approach
Index creation Query and retrieval
Pull Query
API
1 2 1
Storage Azure Cognitive Azure App User
accounts Search Service
3 4
AI enrichment skillsets
(optional) Create system prompt Call language model
2 3
Download a Visio file of this architecture.
Index creation:
1. Azure Cognitive Search is used to create a search index of the documents in Blob
Storage. Azure Cognitive Search supports Blob Storage, so the pull model is used
to crawl the content, and the capability is implemented via indexers.
7 Note
Azure Cognitive Search supports other data sources for indexing when using
the pull model. Documents can also be indexed from multiple data sources
and consolidated into a single index.
If vector fields are added to the index schema, which loads the vector data for
indexing, vector search can be enabled by indexing that vector data. Vector data
can be generated via Azure OpenAI embeddings.
Query and retrieval:
2. The query is passed to Azure Cognitive Search via the search documents REST API.
The query type can be simple, which is optimal for full-text search, or full, which is
for advanced query constructs like regular expressions, fuzzy and wild card search,
and proximity search. If the query type is set to semantic, a semantic search is
performed on the documents, and the relevant content is retrieved. Azure
Cognitive Search also supports vector search and hybrid search, which requires the
user query to be converted to vector embeddings.
3. The retrieved content and the system prompt are sent to the Azure OpenAI
language model, like GPT-3.5 Turbo or GPT-4.
4. The search results are presented as the answer to the search query that was
initiated by the user, or the search results can be used as the grounding data for a
multi-turn conversation scenario.
Azure OpenAI
Azure AI Azure AI language model
Translator Document
1 Intelligence 2
Download a Visio file of this architecture.
Index creation:
The query and retrieval in this approach is the same as the pull approach earlier in this
article.
Components
Azure OpenAI provides REST API access to Azure OpenAI's language models
including the GPT-3, Codex, and the embedding model series for content
generation, summarization, semantic search, and natural language-to-code
translation. Access the service by using a REST API, Python SDK, or the web-based
interface in the Azure OpenAI Studio .
Azure Cognitive Search is a cloud service that provides infrastructure, APIs, and
tools for searching. Use Azure Cognitive Search to build search experiences over
private disparate content in web, mobile, and enterprise applications.
Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for various languages, such as .NET, Node.js, and Python.
Applications can access files in Blob Storage via HTTP or HTTPS. Blob Storage
has hot, cool, and archive access tiers to support cost optimization for storing large
amounts of data.
The Enterprise tier of Azure Cache for Redis provides managed Redis Enterprise
modules, like RediSearch, RedisBloom, RedisTimeSeries, and RedisJSON. Vector
fields allow vector similarity search, which supports real-time vector indexing
(brute force algorithm (FLAT) and hierarchical navigable small world algorithm
(HNSW)), real-time vector updates, and k-nearest neighbor search. Azure Cache for
Redis brings a critical low-latency and high-throughput data storage solution to
modern applications.
Alternatives
Depending on your scenario, you can add the following workflows.
To create vectorized data, you can use any embedding model. You can also use the
Azure AI services Vision image retrieval API to vectorize images. This tool is
available in private preview.
Use the Durable Functions extension for Azure Functions as a code-first integration
tool to perform text-processing steps, like reading handwriting, text, and tables,
and processing language to extract entities on data based on the size and scale of
the workload.
You can use any database for persistent storage of the extracted embeddings,
including:
Azure SQL Database
Azure Cosmos DB
Azure Database for PostgreSQL
Azure Database for MySQL
Scenario details
Manual processing is increasingly time-consuming, error-prone, and resource-intensive
due to the sheer volume of documents. Organizations that handle huge volumes of
documents, largely unstructured data of different formats like PDF, Excel, CSV, Word,
PowerPoint, and image formats, face a significant challenge processing scanned and
handwritten documents and forms from their customers.
These documents and forms contain critical information, such as personal details,
medical history, and damage assessment reports, which must be accurately extracted
and processed.
Organizations often already have their own knowledge base of information, which can
be used for answering questions with the most appropriate answer. You can use the
services and pipelines described in these solutions to create a source for search
mechanisms of documents.
Potential use cases
This solution provides value to organizations in industries like pharmaceutical
companies and financial services. It applies to any company that has a large number of
documents with embedded information. This AI-powered end-to-end search solution
can be used to extract meaningful information from the documents based on the user
query to provide a ChatGPT-style question and answer experience.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
What is Azure AI Document Intelligence?
What is Azure OpenAI?
What is Azure Machine Learning?
Introduction to Blob Storage
What is Azure AI Language?
Introduction to Azure Data Lake Storage Gen2
Azure QnA Maker client library
Create, train, and publish your QnA Maker knowledge base
What is question answering?
Related resources
Query-based document summarization
Automate document identification, classification, and search by using Durable
Functions
Index file content and metadata by using Azure Cognitive Search
AI enrichment with image and text processing
AI at the edge with Azure Stack
Hub
Azure Container Registry Azure Kubernetes Service (AKS) Azure Machine Learning Azure Stack Hub
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This architecture shows how you can bring your trained AI model to the edge with Azure
Stack Hub and integrate it with your applications for low-latency intelligence.
Architecture
Dataflow
1. Data is processed using Azure Data Factory, to be placed on Azure Data Lake.
2. Data from Azure Data Factory is placed into the Azure Data Lake Storage for
training.
3. Data scientists train a model using Azure Machine Learning. The model is
containerized and put into an Azure Container Registry.
4. The model is deployed to a Kubernetes cluster on Azure Stack Hub.
5. The on-premises web application can be used to score data that's provided by the
end user, to score against the model that's deployed in the Kubernetes cluster.
6. End users provide data that's scored against the model.
7. Insights and anomalies from scoring are placed into a queue.
8. A function app gets triggered once scoring information is placed in the queue.
9. A function sends compliant data and anomalies to Azure Storage.
10. Globally relevant and compliant insights are available for consumption in Power BI
and a global app.
11. Feedback loop: The model retraining can be triggered by a schedule. Data
scientists work on the optimization. The improved model is deployed and
containerized as an update to the container registry.
Components
Key technologies used to implement this architecture:
Scenario details
With the Azure AI tools, edge, and cloud platform, edge intelligence is possible. The
next generation of AI-enabled hybrid applications can run where your data lives. With
Azure Stack Hub, bring a trained AI model to the edge, integrate it with your
applications for low-latency intelligence, and continuously feedback into a refined AI
model for improved accuracy, with no tool or process changes for local applications.
This solution idea shows a connected Stack Hub scenario, where edge applications are
connected to Azure. For the disconnected-edge version of this scenario, see the article
AI at the edge - disconnected.
Next steps
Want to learn more? Check out the Introduction to Azure Stack module
Get Microsoft Certified for Azure Stack Hub with the Azure Stack Hub Operator
Associate certification
How to install the AKS Engine on Linux in Azure Stack Hub
How to install the AKS Engine on Windows in Azure Stack Hub
Deploy your ML models to an edge device with Azure Stack Edge Devices
Innovate further and deploy Azure Cognitive Services (Speech, Language, Decision,
Vision) containers to Azure Stack Hub
For more information about the featured Azure services, see the following articles and
samples:
Related resources
See the following related architectures:
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This article outlines a solution for using edge AI when you're disconnected from the
internet. The solution uses Azure Stack Hub to move AI models to the edge.
Apache®, Apache Hadoop , Apache Spark , Apache HBase , and Apache Storm are
either registered trademarks or trademarks of the Apache Software Foundation in the
United States and/or other countries. No endorsement by the Apache Software Foundation
is implied by the use of these marks.
Architecture
Dataflow
1. Data scientists use Azure Machine Learning and an Azure HDInsight cluster to train
a machine learning model. The model is containerized and put into Azure
Container Registry.
2. The model is deployed to an Azure Kubernetes Service (AKS) cluster on Azure
Stack Hub.
3. End users provide data that's scored against the model.
4. Insights and anomalies from scoring are placed into storage for upload later.
5. Globally relevant and compliant insights are available in a global app.
6. Data scientists use scoring from the edge to improve the model.
Components
Machine Learning is a cloud-based environment that you can use to build,
deploy, and manage machine learning models. With these models, you can
forecast future behavior, outcomes, and trends.
HDInsight is a managed, full-spectrum, open-source analytics service in the
cloud for enterprises. You can use open-source frameworks with HDInsight, such as
Hadoop, Spark, HBase, and Storm.
Container Registry is a service that creates a managed registry of container
images. You can use Container Registry to build, store, and manage the images.
You can also use it to store containerized machine learning models.
AKS is a highly available, secure, and fully managed Kubernetes service. AKS
makes it easy to deploy and manage containerized applications.
Azure Virtual Machines is an infrastructure-as-a-service (IaaS) offer. You can use
Virtual Machines to deploy on-demand, scalable computing resources like
Windows and Linux virtual machines.
Azure Storage offers highly available, scalable, secure cloud storage for data,
applications, and workloads.
Azure Stack Hub is an extension of Azure that provides a way to run apps in an
on-premises environment and deliver Azure services to your datacenter.
Scenario details
With Azure AI tools and the Azure edge and cloud platform, edge intelligence is
possible. AI-enabled hybrid applications can run where your data lives, on-premises. By
using Azure Stack Hub, you can bring a trained AI model to the edge and integrate it
with your applications for low-latency intelligence. With this approach, you don't need
to make changes in tools or processes for local applications. When you use Azure Stack
Hub, you can ensure that your cloud solutions work even when you're disconnected
from the internet.
This solution is for a disconnected Azure Stack Hub scenario. Because of latency or
intermittent connectivity issues or regulations, you might not always be connected to
Azure. In disconnected scenarios, you can process data locally and aggregate it later in
Azure for further analysis. For the connected version of this scenario, see AI at the edge.
You have security or other restrictions that require you to deploy Azure Stack Hub
in an environment that isn't connected to the internet.
You want to block data (including usage data) from being sent to Azure.
You want to use Azure Stack Hub purely as a private cloud solution that's deployed
to your corporate intranet, and you aren't interested in hybrid scenarios.
Next steps
For more information about Azure Stack solutions, see the following resources:
For more information about solution components, see the following product
documentation:
Related resources
For related solutions, see the following articles:
This article describes how to use a mobile robot with a live streaming camera to
implement various use cases. The solution implements a system that runs locally on
Azure Stack Edge to ingest and process the video stream and Azure AI services that
perform object detection.
Architecture
Anomaly
Ingestion and processing Object detection Visualization
detection
Key frames 6 8
1
2
Azure AI
Video ingest Anomaly Browser
services
and process detection
container container
3
5
7
Container Key Vault Azure Arc Azure Azure Kubernetes Azure Stack Edge
Azure Registry Monitor Service
Workflow
This workflow describes how the system processes the incoming data:
1. A camera that's installed on the robot streams video in real time by using Real
Time Streaming Protocol (RTSP).
2. A container in the Kubernetes cluster on Azure Stack Edge reads the incoming
stream and splits video into separate images. An open-source software tool called
FFmpeg ingests and processes the video stream.
3. Images are stored in the local Azure Stack Edge storage account.
4. Each time a new key frame is saved in the storage account, an AI Vision container
picks it up. For information about the separation of logic into multiple containers,
see Scenario details.
5. When it loads a key frame from the storage container, the AI Vision container
sends it to Azure AI services in the cloud. This architecture uses Azure AI Vision,
which enables object detection via image analysis.
6. The results of image analysis (detected objects and a confidence rating) are sent to
the anomaly detection container.
7. The anomaly detection container stores the results of image analysis and anomaly
detection in the local Azure Stack Edge Azure SQL database for future reference.
Using a local instance of the database improves access time, which helps to
minimize delays in data access.
8. Data processing is run to detect any anomalies in the incoming real-time video
stream. If anomalies are detected, a front-end UI shows an alert.
Components
Azure Stack Edge is used to host running Azure services on-premises, close to
the location where anomaly detection occurs, which reduces latency.
Azure Kubernetes Service on Azure Stack Edge is used to run a Kubernetes cluster
of containers that contain the system's logic on Azure Stack Edge in a simple and
managed way.
Azure Arc controls the Kubernetes cluster that runs on the edge device.
Azure AI Vision is used to detect objects in key frames of the video stream.
Azure Blob Storage is used to store images of key frames that are extracted from
the video stream.
Azure SQL Edge is used to store data on the edge, close to the service that
consumes and processes it.
Scenario details
This architecture demonstrates a system that processes a real-time video stream,
compares the extracted real-time data with a set of reference data, and makes decisions
based on the results. For example, it could be used to provide scheduled inspections of
a fenced perimeter around a secured location.
The architecture uses Stack Edge to ensure that the most resource-intensive processes
are performed on-premises, close to the source of the video. This design significantly
improves the response time of the system, which is important when an immediate
response to an anomaly is critical.
Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
One of the biggest advantages of using Azure Stack Edge is that you get fully managed
components on your on-premises hardware. All fully managed Azure components are
automatically resilient at a regional level.
In addition, running the system in a Kubernetes cluster enables you to offload the
responsibility for keeping the subsystems healthy to the Kubernetes orchestration
system.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Microsoft Entra managed identities provide security for all components of this
architecture. Using managed identities eliminates the need to store secrets in code or
configuration files. It simplifies access control, credential management, and role
assignment.
Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.
To see a pricing example for this scenario, use the Azure pricing calculator . The most
expensive components in the scenario are Azure Stack Edge and Azure Kubernetes
Service. These services provide capacity for scaling the system to address increased
demand in the future.
The cost of using Azure AI services for object detection varies based on how long the
system runs. The preceding pricing example is based on a system that produces one
image per second and operates for 8 hours per day. One FPS is sufficient for this
scenario. However, if your system needs to run for longer periods of time, the cost of
using Azure AI services is higher:
Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.
Because the code is deployed in a Kubernetes cluster, you can take advantage of the
benefits of this powerful orchestration system. Because the various subsystems are
separated into containers, you can scale only the most demanding parts of the
application. At a basic level, with one incoming video feed, the system can contain just
one node in a cluster. This design significantly simplifies the initial configuration. As
demand for data processing grows, you can easily scale the cluster by adding nodes.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributors:
Next steps
Product documentation:
Object detection
Responsible use of AI
What is Azure Stack Edge Pro 2?
Azure Kubernetes Service
Azure Arc overview
Related resources
Image classification on Azure
AI enrichment with image and
text processing
Azure App Service Azure Blob Storage Azure Cognitive Search Azure Functions
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This article presents a solution that enriches text and image documents by using image
processing, natural language processing, and custom skills to capture domain-specific
data. Azure Cognitive Search with AI enrichment can help identify and explore relevant
content at scale. This solution uses AI enrichment to extract meaning from the original
complex, unstructured JFK Assassination Records (JFK Files) dataset.
Architecture
Text Computer
Translator
Analytics Vision
Ingestion Enrich Index Query
Documents
Custom skills
4
1 2 5 7
Document Enriched Search Web
cracking documents index application
Projections 8
6
Microsoft
Azure
Blob Storage Table Storage
Knowledge store
Dataflow
The above diagram illustrates the process of passing the unstructured JFK Files dataset
through the Azure Cognitive Search skills pipeline to produce structured, indexable data:
1. Unstructured data in Azure Blob Storage, such as documents and images, ingest
into Azure Cognitive Search.
2. The document cracking step initiates the indexing process by extracting images
and text from the data, followed by content enrichment. The enrichment steps that
occur in this process depend on the data and type of skills selected.
3. Built-in skills based on the Computer Vision and Language Service APIs enable AI
enrichments including image optical character recognition (OCR), image analysis,
text translation, entity recognition, and full-text search.
4. Custom skills support scenarios that require more complex AI models or services.
Examples include Forms Recognizer, Azure Machine Learning models, and Azure
Functions.
5. Following the enrichment process, the indexer saves the outputs into a search
index that contains the enriched and indexed documents. Full-text search and
other query forms can use this index.
6. The enriched documents can also project into a knowledge store, which
downstream apps like knowledge mining or data science can use.
7. Queries access the enriched content in the search index. The index supports
custom analyzers, fuzzy search queries, filters, and a scoring profile to tune search
relevance.
8. Any application that connects to Blob Storage or to Azure Table Storage can access
the knowledge store.
Components
Azure Cognitive Search works with other Azure components to provide this solution.
Azure Cognitive Search indexes the content and powers the user experience in this
solution. Azure Cognitive Search can apply pre-built cognitive skills to the content, and
the extensibility mechanism can add custom skills for specific enrichment
transformations.
Azure Storage
Azure Blob Storage is REST-based object storage for data that you can access from
anywhere in the world via HTTPS. You can use Blob Storage to expose data publicly to
the world or to store application data privately. Blob Storage is ideal for large amounts
of unstructured data like text or graphics.
Azure Functions
Azure Functions is a serverless compute service that lets you run small pieces of
event-triggered code without having to explicitly provision or manage infrastructure.
This solution uses an Azure Functions method to apply the CIA Cryptonyms list to the
JFK Assassination Records as a custom skill.
Scenario details
Large, unstructured datasets can include typewritten and handwritten notes, photos and
diagrams, and other unstructured data that standard search solutions can't parse. The
JFK Assassination Records contain over 34,000 pages of documents about the CIA
investigation of the 1963 JFK assassination.
The JFK Files sample project and online demo showcase a particular Azure
Cognitive Search use case. This solution idea isn't intended to be a framework or
scalable architecture for all scenarios, but to provide a general guideline and example.
The code project and demo create a public website and publicly readable storage
container for extracted images, so you shouldn't use this solution with non-public data.
AI enrichment in Azure Cognitive Search can extract and enhance searchable, indexable
text from images, blobs, and other unstructured data sources like the JFK Files. AI
enrichment uses pre-trained machine learning skill sets from the Cognitive Services
Computer Vision and Cognitive Service for Language APIs. You can also create and
attach custom skills to add special processing for domain-specific data like CIA
Cryptonyms. Azure Cognitive Search can then index and search that context.
The Azure Cognitive Search skills in this solution fall into the following categories:
Image processing. Built-in text extraction and image analysis skills include object
and face detection, tag and caption generation, and celebrity and landmark
identification. These skills create text representations of image content, which are
searchable by using the query capabilities of Azure Cognitive Search. Document
cracking is the process of extracting or creating text content from non-text sources.
Principal author:
Next steps
Learn more about this solution:
Related resources
See the related architectures and guidance:
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This article presents a solution for real-time inferencing on Azure Kubernetes Service
(AKS).
Architecture
App developer
Machine Learning
model in containers
Kubeflow
3
Azure Container
Data scientist 2
Registry Parameter GPU-enabled
Worker nodes
server nodes Virtual Machines
Azure Blob
storage
Microsoft
Azure
Components
Blob Storage is a service that's part of Azure Storage . Blob Storage offers
optimized cloud object storage for large amounts of unstructured data.
Container Registry builds, stores, and manages container images and can store
containerized machine learning models.
AKS is a highly available, secure, and fully managed Kubernetes service. AKS
makes it easy to deploy and manage containerized applications.
Machine Learning is a cloud-based environment that you can use to train,
deploy, automate, manage, and track machine learning models. You can use the
models to forecast future behavior, outcomes, and trends.
Scenario details
AKS is useful when you need high-scale production deployments of your machine
learning models. A high-scale deployment involves a fast response time, autoscaling of
the deployed service, and logging. For more information, see Deploy a model to an
Azure Kubernetes Service cluster.
This solution uses Kubeflow to manage the deployment to AKS. The machine learning
models run on AKS clusters that are backed by GPU-enabled virtual machines (VMs).
Next steps
What is Azure Machine Learning?
Azure Kubernetes Service (AKS)
Deploy a model to an Azure Kubernetes Service cluster
Kubeflow on Azure
What is Azure Blob Storage?
Introduction to container registries in Azure
Related resources
Artificial intelligence (AI) - Architectural overview
Orchestrate MLOps by using
Azure Databricks
Azure Databricks
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This article provides a machine learning operations (MLOps) architecture and process
that uses Azure Databricks. This process defines a standardized way to move machine
learning models and pipelines from development to production, with options to include
automated and manual processes.
Architecture
Workflow
This solution provides a robust MLOps process that uses Azure Databricks. All elements
in the architecture are pluggable, so you can integrate other Azure and third-party
services throughout the architecture as needed. This architecture and description are
adapted from the e-book The Big Book of MLOps . This e-book explores the
architecture described here in more detail.
Source control: This project's code repository organizes the notebooks, modules,
and pipelines. Data scientists create development branches to test updates and
new models. Code is developed in notebooks or in IDEs, backed by Git, with
Databricks Repos integration for syncing with your Azure Databricks workspaces.
Source control promotes machine learning pipelines from development, through
staging (for testing), to production (for deployment).
Development
In the development environment, data scientists and engineers develop machine
learning pipelines.
2. Model training and other machine learning pipelines: Machine learning pipelines
are developed as modular code in notebooks and/or IDEs. For example, the model
training pipeline reads data from the Feature Store and other Lakehouse tables.
Training and tuning log model parameters and metrics to the MLflow tracking
server. The Feature Store API logs the final model. These logs link the model, its
inputs, and the training code.
3. Commit code: To promote the machine learning workflow toward production, the
data scientist commits the code for featurization, training, and other pipelines to
source control.
Staging
In the staging environment, CI infrastructure tests changes to machine learning pipelines
in an environment that mimics production.
4. Merge request: When a merge (or pull) request is submitted against the staging
(main) branch of the project in source control, a continuous integration and
continuous delivery (CI/CD) tool like Azure DevOps runs tests.
5. Unit and CI tests: Unit tests run in CI infrastructure, and integration tests run end-
to-end workflows on Azure Databricks. If tests pass, the code changes merge.
6. Build a release branch: When machine learning engineers are ready to deploy the
updated machine learning pipelines to production, they can build a new release. A
deployment pipeline in the CI/CD tool redeploys the updated pipelines as new
workflows.
Production
7. Feature table refresh: This pipeline reads data, computes features, and writes to
Feature Store tables. It runs continuously in streaming mode, runs on a schedule,
or is triggered.
10. Model deployment: As a model enters production, it's deployed for scoring or
serving. The most common deployment modes are:
Batch or streaming scoring: For latencies of minutes or longer, batch and
streaming are the most cost-effective options. The scoring pipeline reads the
latest data from the Feature Store, loads the latest production model version
from the Model Registry, and performs inference in a Databricks job. It can
publish predictions to Lakehouse tables, a Java Database Connectivity (JDBC)
connection, flat files, message queues, or other downstream systems.
Online serving (REST APIs): For low-latency use cases, online serving is
generally necessary. MLflow can deploy models to MLflow Model Serving on
Azure Databricks, cloud provider serving systems, and other systems. In all
cases, the serving system is initialized with the latest production model from
the Model Registry. For each request, it fetches features from an online
Feature Store and makes predictions.
11. Monitoring: Continuous or periodic workflows monitor input data and model
predictions for drift, performance, and other metrics. Delta Live Tables can simplify
the automation of monitoring pipelines, storing the metrics in Lakehouse tables.
Databricks SQL, Power BI, and other tools can read from those tables to create
dashboards and alerts.
12. Retraining: This architecture supports both manual and automatic retraining.
Scheduled retraining jobs are the easiest way to keep models fresh.
Components
Data Lakehouse . A Lakehouse architecture unifies the best elements of data
lakes and data warehouses, delivering data management and performance
typically found in data warehouses with the low-cost, flexible object stores offered
by data lakes.
Delta Lake is the recommended choice for an open-source data format for a
lakehouse. Azure Databricks stores data in Data Lake Storage and provides a
high-performance query engine.
MLflow is an open-source project for managing the end-to-end machine
learning lifecycle. These are its main components:
Tracking allows you to track experiments to record and compare parameters,
metrics, and model artifacts.
Databricks Autologging extends MLflow automatic logging to track
machine learning experiments, automatically logging model parameters,
metrics, files, and lineage information.
MLFlow Model allows you to store and deploy models from any machine
learning library to various model serving and inference platforms.
Model Registry provides a centralized model store for managing model
lifecycle stage transitions from development to production.
Model Serving enables you to host MLflow models as REST endpoints.
Azure Databricks . Azure Databricks provides a managed MLflow service with
enterprise security features, high availability, and integrations with other Azure
Databricks workspace features.
Databricks Runtime for Machine Learning automates the creation of a cluster
that's optimized for machine learning, preinstalling popular machine learning
libraries like TensorFlow, PyTorch, and XGBoost in addition to Azure Databricks
for Machine Learning tools like AutoML and Feature Store clients.
Feature Store is a centralized repository of features. It enables feature sharing
and discovery, and it helps to avoid data skew between model training and
inference.
Databricks SQL. Databricks SQL provides a simple experience for SQL queries
on Lakehouse data, and for visualizations, dashboards, and alerts.
Databricks Repos provides integration with your Git provider in the Azure
Databricks workspace, simplifying collaborative development of notebooks or
code and IDE integration.
Workflows and jobs provide a way to run non-interactive code in an Azure
Databricks cluster. For machine learning, jobs provide automation for data
preparation, featurization, training, inference, and monitoring.
Alternatives
You can tailor this solution to your Azure infrastructure. Common customizations
include:
Scenario details
MLOps helps to reduce the risk of failures in machine learning and AI systems and to
improve the efficiency of collaboration and tooling. For an introduction to MLOps and
an overview of this architecture, see Architecting MLOps on the Lakehouse .
Classical machine learning, like linear models, tree-based models, and boosting.
Modern deep learning, like TensorFlow and PyTorch.
Custom analytics, like statistics, Bayesian methods, and graph analytics.
The architecture supports both small data (single machine) and large data (distributed
computing and GPU-accelerated). In each stage of the architecture, you can choose
compute resources and libraries to adapt to your data and problem dimensions.
The architecture applies to all types of industries and business use cases. Azure
Databricks customers using this and similar architectures include small and large
organizations in industries like these:
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributor:
Next steps
The Big Book of MLOps
Need for Data-centric ML Platforms (introduction to MLOps)
Databricks Machine Learning in-product quickstart
10-minute tutorials: Get started with machine learning on Azure Databricks
Databricks Machine Learning documentation
Databricks Machine Learning product page and resources
MLOps on Databricks: A How-To Guide
Automating the ML Lifecycle With Databricks Machine Learning
MLOps on Azure Databricks with MLflow
Machine Learning Engineering for the Real World
Automate Your Machine Learning Pipeline
Databricks Academy
Databricks Academy GitHub project (free training)
MLOps glossary
Three Principles for Selecting Machine Learning Platforms
What is a Lakehouse?
Delta Lake home page
Ingest data into the Azure Databricks Lakehouse
Clusters
Libraries
MLflow Documentation
Azure Databricks MLflow guide
Share models across workspaces
Notebooks
Developer tools and guidance
Deploy MLflow models to online endpoints in Azure Machine Learning
Deploy to Azure Kubernetes Service (AKS)
Related resources
MLOps framework to upscale machine learning lifecycle with Azure Machine
Learning
MLOps v2
MLOps maturity model
Deploy AI and machine learning
computing on-premises and to
the edge
Azure Container Registry Azure IoT Edge Azure Machine Learning Azure Stack Edge
This reference architecture illustrates how to use Azure Stack Edge to extend rapid
machine learning inference from the cloud to on-premises or edge scenarios. Azure
Stack Hub delivers Azure capabilities such as compute, storage, networking, and
hardware-accelerated machine learning to any edge location.
Architecture
On-premises Azure
Training data
Azure Blob
storage
Azure IoT
Hub
Azure Stack
Edge
Model
Azure Machine
Learning
Genrated
model
Azure
Container
Registry
Sampled data Stored model
Download a Visio file of this architecture.
Workflow
The architecture consists of the following steps:
Azure Machine Learning. Machine Learning lets you build, train, deploy, and
manage machine learning models in a cloud-based environment. These models
can then deploy to Azure services, including (but not limited to) Azure Container
Instances, Azure Kubernetes Service (AKS), and Azure Functions.
Azure Container Registry. Container Registry is a service that creates and manages
the Docker Registry. Container Registry builds, stores, and manages Docker
container images and can store containerized machine learning models.
Azure Stack Edge. Azure Stack Edge is an edge computing device that's designed
for machine learning inference at the edge. Data is preprocessed at the edge
before transfer to Azure. Azure Stack Edge includes compute acceleration
hardware that's designed to improve performance of AI inference at the edge.
Local data. Local data references any data that's used in the training of the
machine learning model. The data can be in any local storage solution, including
Azure Arc deployments.
Components
Azure Machine Learning
Azure Container Registry
Azure Stack Edge
Azure IoT Hub
Azure Blob Storage
Scenario details
Run local, rapid machine learning inference against data as it's ingested and you
have a significant on-premises hardware footprint.
Create long-term research solutions where existing on-premises data is cleaned
and used to generate a model. The model is then used both on-premises and in
the cloud; it's retrained regularly as new data arrives.
Build software applications that need to make inferences about users, both at a
physical location and online.
Recommendations
Each IoT Edge module is a Docker container that does a specific task in an ingest,
transform, and transfer workflow. For example, an IoT Edge module can collect data
from an Azure Stack Edge local share and transform the data into a format that's ready
for machine learning. Then, the module transfers the transformed data to an Azure Stack
Edge cloud share. You can add custom or built-in modules to your IoT Edge device or
develop custom IoT Edge modules..
7 Note
IoT Edge modules are registered as Docker container images in Container Registry.
In the Azure Stack Edge resource on the Azure cloud platform, the cloud share is backed
by an Azure Blob storage account resource. All data in the cloud share will automatically
upload to the associated storage account. You can verify the data transformation and
transfer by either mounting the local or cloud share, or by traversing the Azure Storage
account.
You can use the Machine Learning command-line interface (CLI), the R SDK , the
Python SDK, designer, or Visual Studio Code to build the scripts that are required to
train your model.
After training and readying the model to deploy, you can deploy it to various Azure
services, including but not limited to:
Azure Container Registry. You can deploy the models to a private Docker Registry
such as Azure Container Registry since they are Docker container images.
Azure Container Instances. You can deploy the model's Docker container image
directly to a container group.
Azure Kubernetes Service. You can use Azure Kubernetes Service to automatically
scale the model's Docker container image for high-scale production deployments.
Azure Functions. You can package a model to run directly on a Functions instance.
Azure Machine Learning. You can use Compute instances, managed cloud-based
development workstations, for both training and inference of models. You can also
similarly deploy the model to on-premises IoT Edge and Azure Stack Edge devices.
7 Note
For this reference architecture, the model deploys to Azure Stack Edge to make the
model available for inference on-premises. The model also deploys to Container
Registry to ensure that the model is available for inference across the widest variety
of Azure services.
Additionally, Azure Stack Edge continues to transfer data to Machine Learning for
continuous retraining and improvement by using a machine learning pipeline that's
associated with the model that's already running against data stored locally.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Availability
Consider placing your Azure Stack Edge resource in the same Azure region as
other Azure services that will access it. To optimize upload performance, consider
placing your Azure Blob storage account in the region where your appliance has
the best network connection.
Consider Azure ExpressRoute for a stable, redundant connection between your
device and Azure.
Manageability
Administrators can verify that the data source from local storage has transferred to
the Azure Stack Edge resource correctly. They can verify by mounting the Server
Message Block (SMB)/Network File System (NFS) file share or connecting to the
associated Blob storage account by using Azure Storage Explorer .
Use Machine Learning datasets to reference your data in Blob storage while
training your model. Referencing storage eliminates the need to embed secrets,
data paths, or connection strings in your training scripts.
In your Machine Learning workspace, register and track ML models to track
differences between your models at different points in time. You can similarly
mirror the versioning and tracking metadata in the tags that you use for the
Docker container images that deploy to Container Registry.
DevOps
Review the MLOps lifecycle management approach for Machine Learning. For
example, use GitHub or Azure Pipelines to create a continuous integration process
that automatically trains and retrains a model. Training can be triggered either
when new data populates the dataset or a change is made to the training scripts.
The Azure Machine Learning workspace will automatically register and manage
Docker container images for machine learning models and IoT Edge modules.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Next steps
Product documentation
Related resources
Build an enterprise-grade conversational bot
Image classification on Azure
Many models machine learning
(ML) at scale in Azure with Spark
Azure Data Factory Azure Data Lake Azure Databricks Azure Machine Learning Azure Synapse Analytics
This article describes an architecture for many models that uses Apache Spark in either
Azure Databricks or Azure Synapse Analytics. Spark is a powerful tool for the large and
complex data transformations that some solutions require.
7 Note
Use Spark versions 3.0 and later for many models applications. The data
transformation capabilities and support for Python and pandas are much better
than in earlier versions.
A companion article, Many models machine learning (ML) at scale with Azure Machine
Learning, uses Machine Learning and compute clusters.
Architecture
Dataflow
1. Data ingestion: Azure Data Factory pulls data from a source database and copies it
to Azure Data Lake Storage.
2. Model-training pipeline:
a. Prepare data: The training pipeline pulls the data from Data Lake Storage and
uses Spark to group it into datasets for training the models.
b. Train models: The pipeline trains models for all the datasets that were created
during data preparation. It uses the pandas function API to train multiple
models in parallel. After a model is trained, the pipeline registers it into Machine
Learning along with the testing metrics.
3. Model-promotion pipeline:
a. Evaluate models: The promotion pipeline evaluates the trained models before
moving them to production. A DevOps pipeline applies business logic to
determine whether a model meets the criteria for deployment. For example, it
might check that the accuracy of the testing data is over 80 percent.
b. Register models: The promotion pipeline registers the models that qualify to
the production Machine Learning workspace.
4. Model batch-scoring pipeline:
a. Prepare data: The batch-scoring pipeline pulls data from Data Lake Storage and
uses Spark to group it into datasets for scoring.
b. Score models: The pipeline uses the pandas function API to score multiple
datasets in parallel. It finds the appropriate model for each dataset in Machine
Learning by searching the model tags. Then it downloads the model and uses it
to score the dataset. It uses the Spark connector to Synapse SQL to retain the
results.
5. Real-time scoring: Azure Kubernetes Service (AKS) can do real-time scoring if
needed. Because of the large number of models, they should be loaded on
demand, not pre-loaded.
6. Results:
a. Predictions: The batch-scoring pipeline saves predictions to Synapse SQL.
b. Metrics: Power BI connects to the model predictions to retrieve and aggregate
results for presentation.
Components
Azure Machine Learning is an enterprise-grade ML service for building and
deploying models quickly. It provides users at all skill levels with a low-code
designer, automated ML (AutoML), and a hosted Jupyter notebook environment
that supports various IDEs.
Azure Synapse Analytics is an analytics service that unifies data integration,
enterprise data warehousing, and big data analytics.
Synapse SQL is a distributed query system for T-SQL that enables data
warehousing and data virtualization scenarios and extends T-SQL to address
streaming and ML scenarios. It offers both serverless and dedicated resource
models.
Azure Data Lake Storage is a massively scalable and secure storage service for
high-performance analytics workloads.
Azure Kubernetes Service (AKS) is a fully managed Kubernetes service for
deploying and managing containerized applications. AKS simplifies deployment of
a managed AKS cluster in Azure by offloading the operational overhead to Azure.
Azure DevOps is a set of developer services that provide comprehensive
application and infrastructure lifecycle management. DevOps includes work
tracking, source control, build and CI/CD, package management, and testing
solutions.
Microsoft Power BI is a collection of software services, apps, and connectors that
work together to turn unrelated sources of data into coherent, visually immersive,
and interactive insights.
Alternatives
You can use Spark in Azure Synapse instead of Spark in Azure Databricks for model
training and scoring.
The source data can come from any database.
You can use a managed online endpoint or AKS to deploy real-time inferencing.
Scenario details
Many machine learning (ML) problems are too complex for a single ML model to solve.
Whether it's predicting sales for every item of every store, or modeling maintenance for
hundreds of oil wells, having a model for each instance might improve results on many
ML problems. This many models pattern is very common across a wide variety of
industries, and applies to many real-world use cases. With the use of Azure Machine
Learning, an end-to-end many models pipeline can include model training, batch-
inferencing deployment, and real-time deployment.
A many models solution requires a different dataset for every model during training and
scoring. For instance, if the task is to predict sales for every item of every store, every
dataset will be for a unique item-store combination.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Data partitions Partitioning the data is the key to implementing the many models
pattern. If you want one model per store, a dataset comprises all the data for one
store, and there are as many datasets as there are stores. If you want to model
products by store, there will be a dataset for every combination of product and
store. Depending on the source data format, it might be easy to partition the data,
or it can require extensive data shuffling and transformation. Spark and Synapse
SQL scale very well for such tasks, while Python pandas doesn't, since it runs only
on one node and process.
Model management: The training and scoring pipelines identify and invoke the
right model for each dataset. To do this, they calculate tags that characterize the
dataset, and then use the tags to find the matching model. The tags identify the
data partition key and the model version, and might also provide other
information.
Choosing the right architecture:
Spark is appropriate when your training pipeline has complex data
transformation and grouping requirements. It provides flexible splitting and
grouping techniques to group data by combinations of characteristics, such as
product-store or location-product. The results can be placed in a Spark
DataFrame for use in subsequent steps.
When your ML training and scoring algorithms are straightforward, you might
be able to partition data with libraries such as Scikit-learn. In such cases, you
might not need Spark, so you can avoid possible complexities that can arise
when installing Azure Synapse or Azure Databricks.
When the training datasets are already created—for example, they're in
separate files or in separate rows or columns—you don’t need Spark for
complex data transformations.
The Machine Learning and compute clusters solution provides great versatility
for situations that require complex setup. For example, you can make use of a
custom Docker container, or download files, or download pre-trained models.
Computer vision and natural language processing (NLP) deep learning are
examples of applications that might require such versatility.
Spark training and scoring: When you use the Spark architecture, you can use the
Spark pandas function API for parallel training and scoring.
Separate model repos: To protect the deployed models, consider storing them in
their own repository that the training and testing pipelines don't touch.
Online inferencing: If a pipeline loads and caches all models at the start, the
models might exhaust the container's memory. Therefore, load the models on
demand in the run method, even though it might increase latency slightly.
Training scalability: By using Spark, you can train hundreds of thousands of
models in parallel. Spark spins up multiple training processes in every VM in a
cluster. Each core can run a separate process. While this means good utilization of
resources, it's important to size the cluster accurately and choose the right SKU,
especially if the training process is expensive and long running.
Implementation details: For detailed information on implementing a many models
solution, see Implement many models for ML in Azure .
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
To better understand the cost of running this scenario on Azure, use the pricing
calculator . Good starting assumptions are:
To see how pricing differs for your use case, change the variables to match your
expected data size and serving load requirements. For larger or smaller training data
sizes, increase or decrease the size of the Azure Databricks cluster. To handle more
concurrent users during model serving, increase the AKS cluster size.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
What are compute targets in Azure Machine Learning?
Azure Arc-enabled Machine Learning
Many Models Solution Accelerator
ParallelRunStep Class
pandas function APIs
Connect to storage services on Azure
What is Azure Synapse Analytics?
Deploy a model to an Azure Kubernetes Service cluster
Related resources
Analytics architecture design
Choose an analytical data store in Azure
Choose a data analytics technology in Azure
Many models machine learning (ML) at scale with Azure Machine Learning
Batch scoring of Spark models on Azure Databricks
Many models machine learning
(ML) at scale with Azure Machine
Learning
Azure Data Factory Azure Data Lake Azure Databricks Azure Machine Learning Azure Synapse Analytics
This article describes an architecture for many models that uses Machine Learning and
compute clusters. It provides great versatility for situations that require complex setup.
A companion article, Many models machine learning (ML) at scale in Azure with Spark,
uses Apache Spark in either Azure Databricks or Azure Synapse Analytics.
Architecture
Workflow
1. Data ingestion: Azure Data Factory pulls data from a source database and copies it
to Azure Data Lake Storage. It then stores it in a Machine Learning datastore as a
tabular dataset.
2. Model-training pipeline:
a. Prepare data: The training pipeline pulls the data from the datastore and
transforms it further, as needed. It also groups the data into datasets for
training the models.
b. Train models: The pipeline trains models for all the datasets that were created
during data preparation. It uses the ParallelRunStep class to train multiple
models in parallel. After a model is trained, the pipeline registers it into Machine
Learning along with the testing metrics.
3. Model-promotion pipeline:
a. Evaluate models: The promotion pipeline evaluates the trained models before
moving them to production. A DevOps pipeline applies business logic to
determine whether a model meets the criteria for deployment. For example, it
might check that the accuracy of the testing data is over 80 percent.
b. Register models: The promotion pipeline registers the models that qualify to
the production Machine Learning workspace.
4. Model batch-scoring pipeline:
a. Prepare data: The batch-scoring pipeline pulls data from the datastore and
further transforms each file as needed. It also groups the data into datasets for
scoring.
b. Score models: The pipeline uses the ParallelRunStep class to score multiple
datasets in parallel. It finds the appropriate model for each dataset in Machine
Learning by searching the model tags. Then it downloads the model and uses it
to score the dataset. It uses the DataTransferStep class to write the results back
to Azure Data Lake, and then passes predictions from Azure Data Lake to
Synapse SQL for serving.
5. Real-time scoring: Azure Kubernetes Service (AKS) can do real-time scoring if
needed. Because of the large number of models, they should be loaded on
demand, not pre-loaded.
6. Results:
a. Predictions: The batch-scoring pipeline saves predictions to Synapse SQL.
b. Metrics: Power BI connects to the model predictions to retrieve and aggregate
results for presentation.
Components
Azure Machine Learning is an enterprise-grade ML service for building and
deploying models quickly. It provides users at all skill levels with a low-code
designer, automated ML (AutoML), and a hosted Jupyter notebook environment
that supports various IDEs.
Azure Databricks is a cloud-based data-engineering tool that's based on Apache
Spark. It can process and transform massive quantities of data and explore it by
using ML models. You can write jobs in R, Python, Java, Scala, and Spark SQL.
Azure Synapse Analytics is an analytics service that unifies data integration,
enterprise data warehousing, and big data analytics.
Synapse SQL is a distributed query system for T-SQL that enables data
warehousing and data virtualization scenarios and extends T-SQL to address
streaming and ML scenarios. It offers both serverless and dedicated resource
models.
Azure Data Lake Storage is a massively scalable and secure storage service for
high-performance analytics workloads.
Azure Kubernetes Service (AKS) is a fully managed Kubernetes service for
deploying and managing containerized applications. AKS simplifies deployment of
a managed AKS cluster in Azure by offloading the operational overhead to Azure.
Azure DevOps is a set of developer services that provide comprehensive
application and infrastructure lifecycle management. DevOps includes work
tracking, source control, build and CI/CD, package management, and testing
solutions.
Microsoft Power BI is a collection of software services, apps, and connectors that
work together to turn unrelated sources of data into coherent, visually immersive,
and interactive insights.
Alternatives
The source data can come from any database.
You can use a managed online endpoint or AKS to deploy real-time inferencing.
Scenario details
Many machine learning (ML) problems are too complex for a single ML model to solve.
Whether it's predicting sales for every item of every store, or modeling maintenance for
hundreds of oil wells, having a model for each instance might improve results on many
ML problems. This many models pattern is common across a wide variety of industries,
and has many real-world use cases. With the use of Azure Machine Learning, an end-to-
end many models pipeline can include model training, batch-inferencing deployment,
and real-time deployment.
A many models solution requires a different dataset for every model during training and
scoring. For instance, if the task is to predict sales for every item of every store, every
dataset will be for a unique item-store combination.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Data partitions Partitioning the data is the key to implementing the many models
pattern. If you want one model per store, a dataset comprises all the data for one
store, and there are as many datasets as there are stores. If you want to model
products by store, there will be a dataset for every combination of product and
store. Depending on the source data format, it may be easy to partition the data,
or it might require extensive data shuffling and transformation. Spark and Synapse
SQL scale very well for such tasks, while Python pandas doesn't, since it runs only
on one node and process.
Model management: The training and scoring pipelines identify and invoke the
right model for each dataset. To do this, they calculate tags that characterize the
dataset, and then use the tags to find the matching model. The tags identify the
data partition key and the model version, and might also provide other
information.
Choosing the right architecture:
Spark is appropriate when your training pipeline has complex data
transformation and grouping requirements. It provides flexible splitting and
grouping techniques to group data by combinations of characteristics, such as
product-store or location-product. The results can be placed in a Spark
DataFrame for use in subsequent steps.
When your ML training and scoring algorithms are straightforward, you might
be able to partition data with libraries such as Scikit-learn. In such cases, you
might not need Spark, so you can avoid possible complexities that can arise
when installing Azure Synapse or Azure Databricks.
When the training datasets are already created—for example, they're in
separate files or in separate rows or columns—you don’t need Spark for
complex data transformations.
The Machine Learning and compute clusters solution provides great versatility
for situations that require complex setup. For example, you can make use of a
custom Docker container, or download files, or download pre-trained models.
Computer vision and natural language processing (NLP) deep learning are
examples of applications that might require such versatility.
Spark training and scoring: When you use the Spark architecture, you can use the
Spark pandas function API for parallel training and scoring.
Separate model repos: To protect the deployed models, consider storing them in
their own repository that the training and testing pipelines don't touch.
ParallelRunStep Class: The Python ParallelRunStep Class is a powerful option to
run many models training and inferencing. It can partition your data in a variety of
ways, and then apply your ML script on elements of the partition in parallel. Like
other forms of Machine Learning training, you can specify a custom training
environment with access to Python Package Index (PyPI) packages, or a more
advanced custom docker environment for configurations that require more than
standard PyPI. There are many CPUs and GPUs to choose from.
Online inferencing: If a pipeline loads and caches all models at the start, the
models might exhaust the container's memory. Therefore, load the models on
demand in the run method, even though it might increase latency slightly.
Implementation details: For detailed information on implementing a many models
solution, see Implement many models for ML in Azure .
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
To better understand the cost of running this scenario on Azure, use the pricing
calculator . Good starting assumptions are:
To see how pricing differs for your use case, change the variables to match your
expected data size and serving load requirements. For larger or smaller training data
sizes, increase or decrease the size of the Azure Databricks cluster. To handle more
concurrent users during model serving, increase the AKS cluster size.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Azure Arc-enabled Machine Learning
Many Models Solution Accelerator
ParallelRunStep Class
DataTransferStep Class
Connect to storage services on Azure
What is Azure Synapse Analytics?
Deploy a model to an Azure Kubernetes Service cluster
Related resources
Analytics architecture design
Choose an analytical data store in Azure
Choose a data analytics technology in Azure
Many models machine learning (ML) at scale in Azure with Spark
Azure Machine Learning
architecture
Azure Machine Learning Azure Synapse Analytics Azure Container Registry Azure Monitor Power BI
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This architecture shows you the components used to build, deploy, and manage high-
quality models with Azure Machine Learning, a service for the end-to-end ML lifecycle.
Architecture
Logs, files and media
(unstructured)
Azure Data Azure Synapse
Power BI
Lake Storage Analytics
Ingest and store Analyze Visualize
1 2 8
Build
and 3
Business / custom apps
train
(structured)
Retrain Authenticate
4
Deploy 5
The architecture described in this article is based on Azure Machine Learning's CLI
and Python SDK v1. For more information on the new v2 SDK and CLI, see What is
CLI and SDK v2.
Dataflow
1. Bring together all your structured, unstructured, and semi-structured data (logs,
files, and media) into Azure Data Lake Storage Gen2.
2. Use Apache Spark in Azure Synapse Analytics to clean, transform, and analyze
datasets.
3. Build and train machine learning models in Azure Machine Learning.
4. Control access and authentication for data and the ML workspace with Microsoft
Entra ID and Azure Key Vault. Manage containers with Azure Container Registry.
5. Deploy the machine learning model to a container using Azure Kubernetes
Services, securing and managing the deployment with Azure VNets and Azure
Load Balancer.
6. Using log metrics and monitoring from Azure Monitor, evaluate model
performance.
7. Retrain models as necessary in Azure Machine Learning.
8. Visualize data outputs with Power BI.
Components
Azure Machine Learning is an enterprise-grade machine learning (ML) service for
the end-to-end ML lifecycle.
Azure Synapse Analytics is a unified service where you can ingest, explore,
prepare, transform, manage, and serve data for immediate BI and machine learning
needs.
Azure Data Lake Storage Gen2 is a massively scalable and secure data lake for
your high-performance analytics workloads.
Azure Container Registry is a registry of Docker and Open Container Initiative
(OCI) images, with support for all OCI artifacts. Build, store, secure, scan, replicate,
and manage container images and artifacts with a fully managed, geo-replicated
instance of OCI distribution.
Azure Kubernetes Service Azure Kubernetes Service (AKS) offers serverless
Kubernetes, an integrated continuous integration and continuous delivery (CI/CD)
experience, and enterprise-grade security and governance. Deploy and manage
containerized applications more easily with a fully managed Kubernetes service.
Azure Monitor lets you collect, analyze, and act on telemetry data from your
Azure and on-premises environments. Azure Monitor helps you maximize
performance and availability of your applications and proactively identify problems
in seconds.
Azure Key Vault safeguards cryptographic keys and other secrets used by cloud
apps and services.
Azure Load Balancer load-balances internet and private network traffic with high
performance and low latency. Load Balancer works across virtual machines, virtual
machine scale sets, and IP addresses.
Power BI is a suite of business analytics tools that deliver insights throughout
your organization. Connect to hundreds of data sources, simplify data prep, and
drive unplanned analysis. Produce beautiful reports, then publish them for your
organization to consume on the web and across mobile devices.
Scenario details
Build, deploy, and manage high-quality models with Azure Machine Learning, a service
for the end-to-end ML lifecycle. Use industry-leading MLOps (machine learning
operations), open-source interoperability, and integrated tools on a secure, trusted
platform designed for responsible machine learning (ML).
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
See documentation for the key services in this solution:
Related resources
See related guidance on the Azure Architecture Center:
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
Learn how Project Bonsai builds and deploys autonomous systems using machine
teaching, deep reinforcement learning, and simulations.
Architecture
Project Bonsai speeds the creation of AI-powered automation. Development and
deployment have three phases: Build, Train, and Deploy.
Dataflow
1. The Build phase consists of writing the machine teaching program and connecting
to a domain-specific training simulator. Simulators generate sufficient training data
for experiments and machine practice.
Subject matter experts with no AI background can express their expertise as steps,
tasks, criteria, and desired outcomes. Engineers build autonomous systems by
creating accurate, detailed models of systems and environments, and making the
systems intelligent using methods like deep learning, imitation learning, and
reinforcement learning.
2. In the Train phase, the training engine automates DRL model generation and
training by combining high-level domain models with appropriate DRL algorithms
and neural networks.
3. The Deploy phase deploys the trained brain to the target application in the cloud,
on-premises, or embedded on site. Specific SDKs and deployment APIs deploy
trained AI systems to various target applications, perform machine tuning, and
control the physical systems.
After training is complete, engineers deploy these trained agents to the real world,
where they use their knowledge to power autonomous systems.
Components
Project Bonsai simplifies machine teaching with DRL to train and deploy smart
autonomous systems.
This architecture uses the basic tier of Container Registry to store exported brains
and uploaded simulators.
Azure Container Instances runs containers on-demand in a serverless Microsoft
Azure environment. Container Instances is the fastest and simplest way to run a
container in Azure, and doesn't require you to provision virtual machines or adopt
a higher-level service.
Azure Storage is a cloud storage solution that includes object, blob, file, disk,
queue, and table storage.
This architecture uses Storage for storing uploaded simulators as ZIP files.
Scenario details
Artificial intelligence (AI) and machine learning offer unique opportunities and
challenges for automating complex industrial systems. Machine teaching is a new
paradigm for building machine learning systems that moves the focus away from
algorithms and towards successful model generation and deployment.
Machine teaching infuses subject matter expertise into automated AI system training
with deep reinforcement learning (DRL) and simulations. Abstracting away AI complexity
to focus on subject matter expertise and real-world conditions creates models that turn
automated control systems into autonomous systems.
Use machine teaching to combine human domain knowledge with AI and machine
learning.
Automate the generation and management of DRL algorithms and models.
Integrate simulations for model optimization and scalability during training.
Deploy and scale for real-world use.
Machine teaching bridges AI science and software with traditional engineering and
domain expertise. Example applications include:
Motion control
Machine calibration
Smart buildings
Industrial robotics
Process control
Teach adaptive brains with intuitive goals and learning objectives, real-time success
assessments, and automatic versioning control.
Integrate training simulations that implement real-world problems and provide
realistic feedback.
Export trained brains and deploy them on-premises, in the cloud, or to IoT Edge
devices or embedded devices.
The Bonsai platform runs on Azure and charges resource costs to your Azure
subscription.
Azure Container Registry (basic tier) for storing exported brains and uploaded
simulators.
Azure Container Instances for running simulations.
Azure Storage for storing uploaded simulators as ZIP files.
Inkling
Training engine
The training engine in Bonsai compiles machine teaching programs to automatically
generate and train AI systems. It does the following:
Just as a language compiler hides the machine code from the programmer, the training
engine hides the details of the machine learning models and DRL algorithms. As new
algorithms and network topologies are invented, the training engine can recompile the
same machine teaching programs to exploit them.
Cartpole sample
Bonsai includes two machine teaching samples, Cartpole and Moab .
The Cartpole sample has a pole attached to a cart by an unactivated joint. The cart
moves along a straight frictionless track and the pole moves forward and backward,
depending on the movements of the cart. The available sensor information includes the
cart position and velocity and pole angle and angular velocity. The supported agent
actions are to push the cart to the left or the right.
The pole starts upright, and the goal is to keep it upright as the cart moves. There is a
reward generated for every time interval that the pole remains upright. A training
episode ends when the pole is more than 15 degrees from vertical, or when the cart
moves more than a predefined number of units from the center of the track.
The sample uses Inkling language to write the machine teaching program, and the
provided Cartpole simulator to speed and improve the training.
The following Bonsai screenshot shows Cartpole training progress, with Goal
satisfaction on the y-axis and Training iterations on the x-axis. The dashboard also
shows the percentage of goal satisfaction and the total elapsed training time.
For more information about the Cartpole example, or to try it yourself, see:
Simulations are the ideal training source for DRL because they:
Simulations are available across a broad range of industries and systems such as
mechanical and electrical engineering, autonomous vehicles, security and networking,
transportation and logistics, and robotics.
The Bonsai platform includes Simulink and AnyLogic simulators. You can add others.
AirSim
Microsoft AirSim (Aerial Informatics and Robotics Simulation) is an open-source
robotics simulation platform designed to train autonomous systems. AirSim provides a
realistic simulation tool for designers and developers to generate the large amounts of
data they need for model training and debugging.
AirSim can capture data from ground vehicles, wheeled robotics, aerial drones, and even
static IoT devices, and do it without costly field operations.
AirSim works as a plug-in to the Unreal Engine editor from Epic Games, providing
control over building environments and simulating difficult-to-reproduce, real-world
events to capture meaningful data. AirSim leverages current game engine rendering,
physics, and perception computation to create an accurate, real-world simulation.
This realism, based on efficiently generated ground-truth data, enables the study and
execution of complex missions that are time-consuming or risky in the real world. For
example, AirSim provides realistic environments, vehicle dynamics, and multi-modal
sensing for researchers building autonomous vehicles. Collisions in a simulator cost
virtually nothing, yet provide actionable information to improve the design of the
system.
You can use an Azure Resource Manager (ARM) template to automatically create a
development environment, and code and debug a Python application connected to
AirSim in Visual Studio Code. For more information, see AirSim Development
Environment on Azure .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Autonomous systems with Microsoft AI
Autonomy for industrial control systems
Innovation space: Autonomous systems (Video)
Microsoft The AI Blog
Microsoft Autonomous Systems
Bonsai documentation
Aerial Informatics and Robotics Platform (AirSim)
How Azure Machine Learning works: Architecture and concepts
Related resources
Use subject matter expertise in machine teaching and reinforcement learning
Building blocks for autonomous-driving simulation environments
Compare the machine learning products and technologies from Microsoft
Data science and machine
learning with Azure Databricks
Azure Databricks Azure Data Lake Storage Azure Kubernetes Service (AKS) Azure Machine Learning
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This architecture shows how you can improve operations by using Azure Databricks,
Delta Lake, and MLflow for data science and machine learning. You can improve your
overall efficiency and the customer experience by developing, training, and deploying
machine learning models.
Architecture
Process Serve
4
Azure Databricks Azure Machine Azure
Learning web Kubernetes
services Service (AKS)
1 2 3
Azure Data
Lake Storage
Store
Microsoft
Azure
Dataflow
Store
Data Lake Storage stores the data in Delta Lake format. Delta Lake forms the curated
layer of the data lake. A medallion architecture organizes the data into three layers:
Process
Code from various languages, frameworks, and libraries prepares, refines, and
cleanses the raw data (1). Coding possibilities include Python, R, SQL, Spark,
Pandas, and Koalas.
Azure Databricks runs data science workloads. This platform also builds and trains
machine learning models (2). Azure Databricks uses pre-installed, optimized
libraries. Examples include scikit-learn, TensorFlow, PyTorch, and XGBoost.
MLflow tracking captures the machine learning experiments, model runs, and
results (3). When the best model is ready for production, Azure Databricks deploys
that model to the MLflow model repository. This centralized registry stores
information on production models. The registry also makes models available to
other components:
Spark and Python pipelines can ingest models. These pipelines handle batch
workloads or streaming ETL processes.
REST APIs provide access to models for many purposes. Examples include
testing and interactive scoring in mobile and web applications.
Serve
Azure Databricks can deploy models to other services, such as Machine Learning and
AKS (4).
Components
Azure Databricks is a data analytics platform. Its fully managed Spark clusters
run data science workloads. Azure Databricks also uses pre-installed, optimized
libraries to build and train machine learning models. MLflow integration with Azure
Databricks provides a way to track experiments, store models in repositories, and
make models available to other services. Azure Databricks offers scalability:
Single-node compute clusters handle small data sets and single-model runs.
For large data sets, multi-node compute clusters or graphics processing unit
(GPU) clusters are available. These clusters use libraries and frameworks like
HorovodRunner and Hyperopt for parallel-model runs.
Data Lake Storage is a scalable and secure data lake for high-performance
analytics workloads. This service manages multiple petabytes of information while
sustaining hundreds of gigabits of throughput. The data can have these
characteristics:
Be structured, semi-structured, or unstructured.
Come from multiple, heterogeneous sources like logs, files, and media.
Be static, from batches, or streaming.
Delta Lake is a storage layer that uses an open file format. This layer runs on top
of cloud storage such as Data Lake Storage. Delta Lake is optimized for
transforming and cleansing batch and streaming data. This platform supports
these features and functionality:
Data versioning and rollback.
Atomicity, consistency, isolation, and durability (ACID) transactions for reliability.
A consistent standard for data preparation, model training, and model serving.
Time travel for consistent snapshots of source data. Data scientists can train
models on the snapshots instead of creating separate copies.
MLflow is an open-source platform for the machine learning life cycle. MLflow
components monitor machine learning models during training and running. Stored
information includes code, data, configuration information, and results. MLflow
also stores models and loads them in production. Because MLflow uses open
frameworks, various services, applications, frameworks, and tools can consume the
models.
AKS is a highly available, secure, and fully managed Kubernetes service. AKS
makes it easy to deploy and manage containerized applications.
Scenario details
As your organization recognizes the power of data science and machine learning, you
can improve efficiency, enhance customer experiences, and predict changes. To achieve
these goals in business-critical use cases, you need a consistent and reliable pattern for:
Tracking experiments.
Reproducing results.
Deploying machine learning models into production.
This article outlines a solution for a consistent, reliable machine learning framework.
Azure Databricks forms the core of the architecture. The storage layer Delta Lake and
the machine learning platform MLflow also play significant roles. These components
integrate seamlessly with other services such as Azure Data Lake Storage, Azure
Machine Learning, and Azure Kubernetes Service (AKS).
Together, these services provide a solution for data science and machine learning that's:
Simple: An open data lake simplifies the architecture. The data lake contains a
curated layer, Delta Lake. That layer provides access to the data in an open-source
format.
Open: The solution supports open-source code, open standards, and open
frameworks. This approach minimizes the need for future updates. Azure
Databricks and Machine Learning natively support MLflow and Delta Lake.
Together, these components provide industry-leading machine learning operations
(MLOps), or DevOps for machine learning. A broad range of deployment tools
integrate with the solution's standardized model format.
Collaborative: Data science and MLOps teams work together with this solution.
These teams use MLflow tracking to record and query experiments. The teams also
deploy models to the central MLflow model registry. Data engineers then use
deployed models in data ingestion, extract-transform-load (ETL) processes, and
streaming pipelines.
Besides energy providers, this solution can benefit any organization that:
Next steps
AGL Energy builds a standardized platform for thousands of parallel models. The
platform provides quick and cost-effective training, deployment, and life-cycle
management for the models.
Open Grid Europe (OGE) uses artificial intelligence models to monitor gas
pipelines. OGE uses Azure Databricks and MLflow to develop the models.
Scandinavian Airlines (SAS) uses Azure Databricks during a collaborative
research phase. The airline also uses Machine Learning to develop predictive
models. By identifying patterns in the company's data, the models improve
everyday operations.
Related resources
Choose an analytical data store in Azure
Batch scoring of Spark models on Azure Databricks
Stream processing with Azure Databricks
Ingestion, ETL, and stream processing pipelines with Azure Databricks
Modern analytics architecture with Azure Databricks
Automate document
identification, classification, and
search by using Durable Functions
Azure Functions Azure App Service Azure AI services Azure Cognitive Search
This article describes an architecture for processing document files that contain multiple
documents of various types. It uses the Durable Functions extension of Azure Functions
to implement the pipelines that process the files.
Architecture
Workflow
1. The user provides a document file that the web app uploads. The file contains
multiple documents of various types. It can, for instance, be a PDF or multipage
TIFF file.
a. The document file is stored in Azure Blob Storage.
b. The web app adds a command message to a storage queue to initiate pipeline
processing.
3. The Scan activity function calls the Computer Vision Read API, passing in the
location in storage of the document to be processed. Optical character recognition
(OCR) results are returned to the orchestration to be used by subsequent activities.
4. The Classify activity function calls the document classifier service that's hosted in
an Azure Kubernetes Service (AKS) cluster. This service uses regular expression
pattern matching to identify the starting page of each known document and to
calculate how many document types are contained in the document file. The types
and page ranges of the documents are calculated and returned to the
orchestration.
7 Note
Azure doesn’t offer a service that can classify multiple document types in a
single file. This solution uses a non-Azure service that's hosted in AKS.
5. The Metadata Store activity function saves the document type and page range
information in an Azure Cosmos DB store.
6. The Indexing activity function creates a new search document in the Cognitive
Search service for each identified document type and uses the Azure Cognitive
Search libraries for .NET to include in the search document the full OCR results and
document information. A correlation ID is also added to the search document so
that the search results can be matched with the corresponding document
metadata from Azure Cosmos DB.
7. End users can search for documents by contents and metadata. Correlation IDs in
the search result set can be used to look up document records that are in Azure
Cosmos DB. The records include links to the original document file in Blob Storage.
Components
Durable Functions is an extension of Azure Functions that makes it possible for
you write stateful functions in a serverless compute environment. In this
application, it's used for managing document ingestion and workflow
orchestration. It lets you define stateful workflows by writing orchestrator functions
that adhere to the Azure Functions programming model. Behind the scenes, the
extension manages state, checkpoints, and restarts, leaving you free to focus on
the business logic.
Azure Cosmos DB is a globally distributed, multi-model database that makes it
possible for your solutions to scale throughput and storage capacity across any
number of geographic regions. Comprehensive service level agreements (SLAs)
guarantee throughput, latency, availability, and consistency.
Azure Storage is a set of massively scalable and secure cloud services for data,
apps, and workloads. It includes Blob Storage , Azure Files , Azure Table
Storage , and Azure Queue Storage .
Azure App Service provides a framework for building, deploying, and scaling
web apps. The Web Apps feature is an HTTP-based service for hosting web
applications, REST APIs, and mobile back ends. With Web Apps, you can develop in
.NET, .NET Core, Java, Ruby, Node.js, PHP, or Python. Applications easily run and
scale in Windows and Linux-based environments.
Azure Cognitive Services provides intelligent algorithms to see, hear, speak,
understand, and interpret your user needs by using natural methods of
communication.
Azure Cognitive Search provides a rich search experience over private,
heterogeneous content in web, mobile, and enterprise applications.
AKS is a highly available, secure, and fully managed Kubernetes service. AKS
makes it easy to deploy and manage containerized applications.
Alternatives
The Form Recognizer read (OCR) model is an alternative to Computer Vision Read.
This solution stores metadata in Azure Cosmos DB to facilitate global distribution.
Azure SQL Database is another option for persistent storage of document
metadata and information.
You can use other messaging platforms, including Azure Service Bus , to trigger
Durable Functions instances.
For a solution accelerator that helps in clustering and segregating data into
templates, see Azure/form-recognizer-accelerator (github.com) .
Scenario details
This article describes an architecture that uses Durable Functions to implement
automated pipelines for processing document files that contain multiple documents of
various types. The pipelines identify the documents in a document file, classify them by
type, and store information that can be used in subsequent processing.
Many companies need to manage and process document files that contain documents
that have been scanned in bulk and that can contain several different document types.
Typically the document files are PDFs or multi-page TIFF images. These files usually
originate from outside the organization, and the receiving company doesn't control the
content.
Given these constraints, organizations have been forced to build their own document
parsing solutions that can include custom technology and manual processes. A solution
can include human intervention for splitting out individual document types into their
own files and adding classifications qualifiers for each document.
Many of these custom solutions are based on the state machine workflow pattern and
use database systems for persisting workflow state, with polling services that check for
the states that they're responsible for processing. Maintaining and enhancing such
solutions can be difficult and time consuming.
Organizations are looking for reliable, scalable, and resilient solutions for processing and
managing document identification and classification for the types of files their
organization uses. This includes processing millions of documents per day with full
observability into the success or failure of the processing pipeline.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework
Reliability
Reliability ensures that your application can meet the commitments that you make to
your customers. For more information, see Overview of the reliability pillar.
A reliable workload is one that's both resilient and available. Resiliency is the ability of
the system to recover from failures and continue to function. The goal of resiliency is to
return the application to a fully functioning state after a failure occurs. Availability is a
measure of whether your users can access your workload when they need to.
For reliability information about solution components, see the following resources:
Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.
The most significant costs for this architecture will potentially come from the storage of
image files in the storage account, Cognitive Services image processing, and index
capacity requirements in the Azure Cognitive Search service.
Costs can be optimized by right sizing the storage account by using reserved capacity
and lifecycle policies, proper Azure Cognitive Search planning for regional deployments
and operational scale up scheduling, and using commitment tier pricing that's available
for the Computer Vision – OCR service to manage predictable costs.
Use the pay-as-you-go strategy for your architecture and scale out as needed
rather than investing in large-scale resources at the start.
Consider opportunity costs in your architecture, and the balance between first-
mover advantage versus fast follow. Use the pricing calculator to estimate the
initial cost and operational costs.
Establish policies, budgets, and controls that set cost limits for your solution.
Performance efficiency
Performance efficiency is the ability of your workload to scale in an efficient manner to
meet the demands that users place on it. For more information, see Performance
efficiency pillar overview.
Periods when this solution processes high volumes can expose performance bottlenecks.
Make sure that you understand and plan for the scaling options for Azure Functions,
Cognitive Services autoscaling, and Azure Cosmos DB partitioning to ensure proper
performance efficiency for your solution.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
Introductory articles:
Product documentation:
Azure documentation (all products)
Durable Functions documentation
Azure Cognitive Services documentation
Azure Cognitive Search documentation
Related resources
Custom document processing models on Azure
Automate document processing by using Azure Form Recognizer
Image classification on Azure
Automate document processing
by using Azure Form Recognizer
Azure Cognitive Search Azure AI services Azure Cosmos DB Azure Document Intelligence
This article outlines a scalable and secure solution for building an automated document
processing pipeline. The solution uses Azure Form Recognizer for the structured
extraction of data. Natural language processing (NLP) models and custom models enrich
the data.
Architecture
Other sources of data Extraction Enrichment Analytics and visualizations
Data to be Azure
processed Trigger Cognitive
4 Service for
Attachments 1 Language
(email or social media apps) Azure Blob Storage
Real-time scoring
Back-end
application 2 Power BI
FTP servers
6
Azure Kubernetes
Azure Functions Service (AKS)
5
Azure Web Azure Form
Application Recognizer Other
Firewall 3 Batch scoring Azure Machine Learning applications
1
2
Azure Cosmos
DB Azure Cognitive Search
1 3
File ingestion Browser Azure Back-end
Application application
Gateway
Microsoft
Azure
Dataflow
The following sections describe the various stages of the data extraction process.
2. The back-end application posts a request to a Form Recognizer REST API endpoint
that uses one of these models:
Layout
Invoice
Receipt
ID document
Business card
General document, which is in preview
The response from Form Recognizer contains raw OCR data and structured
extractions. Form Recognizer also assigns [confidence values][Characteristics and
limitations of Form Recognizer - Customer evaluation] to the extracted data.
3. The App Service back-end application uses the confidence values to check the
extraction quality. If the quality is below a specified threshold, the app flags the
data for manual verification. When the extraction quality meets requirements, the
data enters Azure Cosmos DB for downstream application consumption. The app
can also return the results to the front-end browser.
4. Other sources provide images, PDF files, and other documents. Sources include
email attachments and File Transfer Protocol (FTP) servers. Tools like Azure Data
Factory and AzCopy transfer these files to Azure Blob Storage. Azure Logic Apps
offers pipelines for automatically extracting attachments from emails.
Data enrichment
The pipeline that's used for data enrichment depends on the use case.
Receives responses from the Azure Cognitive Service for Language API.
2. Custom models perform fraud detection, risk analysis, and other types of analysis
on the data:
Azure Machine Learning services train and deploy the custom models.
The extracted data is retrieved from Azure Cosmos DB.
The models derive insights from the data.
1. Applications use the raw OCR, structured data from Form Recognizer endpoints,
and the enriched data from NLP:
Components
App Service is a platform as a service (PaaS) offering on Azure. You can use App
Service to host web applications that you can scale in or scale out manually or
automatically. The service supports various languages and frameworks, such as
ASP.NET, ASP.NET Core, Java, Ruby, Node.js, PHP, and Python.
Azure Functions is a serverless compute platform that you can use to build
applications. With Functions, you can use triggers and bindings to react to changes
in Azure services like Blob Storage and Azure Cosmos DB. Functions can run
scheduled tasks, process data in real time, and process messaging queues.
Azure Storage is a cloud storage solution that includes object, blob, file, disk,
queue, and table storage.
Blob Storage is a service that's part of Azure Storage. Blob Storage offers
optimized cloud object storage for large amounts of unstructured data.
Azure Data Lake Storage is a scalable, secure data lake for high-performance
analytics workloads. The data typically comes from multiple heterogeneous
sources and can be structured, semi-structured, or unstructured. Azure Data Lake
Storage Gen2 combines Azure Data Lake Storage Gen1 capabilities with Blob
Storage. As a next-generation solution, Data Lake Storage Gen2 provides file
system semantics, file-level security, and scale. But it also offers the tiered storage,
high availability, and disaster recovery capabilities of Blob Storage.
Azure Cognitive Service for Language offers many NLP services that you can use
to understand and analyze text. Some of these services are customizable, such as
custom NER, custom text classification, conversational language understanding,
and question answering.
Machine Learning is an open platform for managing the development and
deployment of machine-learning models at scale. Machine Learning caters to skill
levels of different users, such as data scientists or business analysts. The platform
supports commonly used open frameworks and offers automated featurization
and algorithm selection. You can deploy models to various targets. Examples
include AKS, Azure Container Instances as a web service for real-time inferencing
at scale, and Azure Virtual Machine for batch scoring. Managed endpoints in
Machine Learning abstract the required infrastructure for real-time or batch model
inferencing.
AKS is a fully managed Kubernetes service that makes it easy to deploy and
manage containerized applications. AKS offers serverless Kubernetes technology,
an integrated continuous integration and continuous delivery (CI/CD) experience,
and enterprise-grade security and governance.
Alternatives
You can use Azure Virtual Machines instead of App Service to host your
application.
You can use any relational database for persistent storage of the extracted data,
including:
Azure SQL Database .
Azure Database for PostgreSQL .
Azure Database for MySQL .
Scenario details
Automating document processing and data extraction is an integral task in
organizations across all industry verticals. AI is one of the proven solutions in this
process, although achieving 100 percent accuracy is a distant reality. But, using AI for
digitization instead of purely manual processes can reduce manual effort by up to 90
percent.
Optical character recognition (OCR) can extract content from images and PDF files,
which make up most of the documents that organizations use. This process uses key
word search and regular expression matching. These mechanisms extract relevant data
from full text and then create structured output. This approach has drawbacks. Revising
the post-extraction process to meet changing document formats requires extensive
maintenance effort.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Availability
The availability of the architecture depends on the Azure services that make up the
solution:
Blob Storage offers redundancy options that help ensure high availability. You can
use either of these approaches to replicate data three times in a primary region:
At a single physical location for locally redundant storage (LRS).
Across three availability zones that use differing availability parameters. For
more information, see Durability and availability parameters. This option works
best for applications that require high availability.
For the availability guarantees of other Azure services in the solution, see these
resources:
SLA for App Service
SLA for Azure Functions
SLA for Application Gateway
SLA for Azure Kubernetes Service (AKS)
Scalability
App Service can automatically scale out and in as the application load varies. For
more information, see Create an autoscale setting for Azure resources based on
performance data or a schedule.
Azure Functions can scale automatically or manually. The hosting plan that you
choose determines the scaling behavior of your function apps. For more
information, see Azure Functions hosting options.
By default, Form Recognizer supports 15 concurrent requests per second. You can
increase this value by creating an Azure support ticket with a quota increase
request.
For custom models that you host as web services on AKS, azureml-fe automatically
scales as needed. This front-end component routes incoming inference requests to
deployed services.
For Azure Cognitive Service for Language, data and rate limits apply. For more
information, see these resources:
How to use named entity recognition (NER)
How to detect and redact personal information
How to use sentiment analysis and opinion mining
How to use Text Analytics for health
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Azure Web Application Firewall helps protect your application from common
vulnerabilities. This Application Gateway option uses Open Web Application
Security Project (OWASP) rules to prevent attacks like cross-site scripting, session
hijacks, and other exploits.
Blob Storage and Azure Cosmos DB encrypt data at rest. You can secure these
services by using service endpoints or private endpoints.
You can configure Form Recognizer and Azure Cognitive Service for Language for
access from specific virtual networks or from private endpoints. These services
encrypt data at rest. You can use subscription keys, tokens, or Microsoft Entra ID to
authenticate requests to these services. For more information, see Authenticate
requests to Azure Cognitive Services.
Resiliency
The solution's resiliency depends on the failure modes of individual services like
App Service, Functions, Azure Cosmos DB, Storage, and Application Gateway. For
more information, see Resiliency checklist for specific Azure services.
You can make Form Recognizer resilient. Possibilities include designing it to fail
over to another region and splitting the workload into two or more regions. For
more information, see Back up and recover your Form Recognizer models.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
The cost of implementing this solution depends on which components you use and
which options you choose for each component.
After deciding on a pricing tier for each component, use the Azure Pricing calculator
to estimate the solution cost.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
What is Azure Form Recognizer?
[Get started: Document Intelligence Studio Studio][Get started: Document
Intelligence Studio Studio]
Use Form Recognizer SDKs or REST API
What is Azure Cognitive Service for Language?
What is Azure Machine Learning?
Introduction to Azure Functions
How to configure Azure Functions with a virtual network
What is Azure Application Gateway?
What is Azure Web Application Firewall on Azure Application Gateway?
Tutorial: How to access on-premises SQL Server from Data Factory Managed VNet
using Private Endpoint
Azure Storage documentation
Related resources
Extract text from objects using Power Automate and AI Builder
[Knowledge mining in business process management][Knowledge mining in
business process management]
[Knowledge mining in contract management][Knowledge mining in contract
management]
Knowledge mining for content research
Automate PDF forms processing
Azure Document Intelligence Azure AI services Azure Logic Apps Azure Functions
This article describes an Azure architecture that you can use to replace costly and
inflexible forms processing methods with cost-effective and flexible automated PDF
processing.
Architecture
Workflow
1. A designated Outlook email account receives PDF files as attachments. The arrival
of an email triggers a logic app to process the email. The logic app is built by using
the capabilities of Azure Logic Apps.
2. The logic app uploads the PDF files to a container in Azure Data Lake Storage.
3. You can also manually or programmatically upload PDF files to the same PDF
container.
4. The arrival of a PDF file in the PDF container triggers another logic app to process
the PDF forms that are in the PDF file.
5. The logic app sends the location of the PDF file to a function app for processing.
The function app is built by using the capabilities of Azure Functions.
6. The function app receives the location of the file and takes these actions:
a. It splits the file into single pages if the file has multiple pages. Each page
contains one independent form. Split files are saved to a second container in
Data Lake Storage.
b. It uses HTTPS POST, an Azure REST API, to send the location of the single-page
PDF file to Azure Form Recognizer for processing. When Form Recognizer
completes its processing, it sends a response back to the function app, which
places the information into a data structure.
c. It creates a JSON data file that contains the response data and stores the file to
a third container in Data Lake Storage.
7. The forms processing logic app receives the processed response data.
8. The forms processing logic app sends the processed data to Azure Cosmos DB,
which saves the data in a database and in collections.
9. Power BI obtains the data from Azure Cosmos DB and provides insights and
dashboards.
10. You can implement further processing as needed on the data that's in Azure
Cosmos DB.
Components
Azure Applied AI Services is a category of Azure AI products that use Azure
Cognitive Services, task-specific AI, and business logic to provide turnkey AI
services for common business processes. One of these products is Form
Recognizer , which uses machine learning models to extract key-value pairs, text,
and tables from documents.
Azure Logic Apps is a serverless cloud service for creating and running
automated workflows that integrate apps, data, services, and systems.
Azure Functions is a serverless solution that makes it possible for you to write
less code, maintain less infrastructure, and save on costs.
Azure Data Lake Storage is the foundation for building enterprise data lakes on
Azure.
Azure Cosmos DB is a fully managed NoSQL and relational database for modern
app development.
Power BI is a collection of software services, apps, and connectors that work
together so that you can turn your unrelated sources of data into coherent, visually
immersive, and interactive insights.
Alternatives
You can use Azure SQL Database instead of Azure Cosmos DB to store the
processed forms data.
You can use Azure Data Explorer to visualize the processed forms data that's
stored in Data Lake Storage.
Scenario details
Forms processing is often a critical business function. Many companies still rely on
manual processes that are costly, time consuming, and prone to error. Replacing manual
processes reduces cost and risk and makes a company more agile.
This article describes an architecture that you can use to replace manual PDF forms
processing or costly legacy systems that automate PDF forms processing. Form
Recognizer processes the PDF forms, Logic Apps provides the workflow, and Functions
provides data processing capabilities.
Invoices
Payment records
Safety records
Incident records
Compliance records
Purchase orders
Payment authorization forms
Health screening forms
Survey forms
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework, a
set of guiding tenets that you can use to improve the quality of a workload. For more
information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures that your application can meet the commitments that you make to
your customers. For more information, see Overview of the reliability pillar.
A reliable workload is one that's both resilient and available. Resiliency is the ability of
the system to recover from failures and continue to function. The goal of resiliency is to
return the application to a fully functioning state after a failure occurs. Availability is a
measure of whether your users can access your workload when they need to.
This architecture is intended as a starter architecture that you can quickly deploy and
prototype to provide a business solution. If your prototype is a success, you can then
extend and enhance the architecture, if necessary, to meet additional requirements.
This architecture utilizes scalable and resilient Azure infrastructure and technologies. For
example, Azure Cosmos DB has built-in redundancy and global coverage that you can
configure to meet your needs.
For the availability guarantees of the Azure services that this solution uses, see Service
Level Agreements (SLA) for Online Services .
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
The Outlook email account that's used in this architecture is a dedicated email account
that receives PDF forms as attachments. It's good practice to limit the senders to trusted
parties only and to prevent malicious actors from spamming the email account.
The implementation of this architecture that's described in Deploy this scenario takes
the following measures to increase security:
The PowerShell and Bicep deployment scripts use Azure Key Vault to store sensitive
information so that it isn't displayed on terminal screens or stored in deployment
logs.
Managed identities provide an automatically managed identity in Microsoft Entra
ID for applications to use when they connect to resources that support Microsoft
Entra authentication. The function app uses managed identities so that the code
doesn't depend on individual principals and doesn't contain sensitive identity
information.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and to
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Here are some guidelines for optimizing costs:
Use the pay-as-you-go strategy for your architecture, and scale out as needed
rather than investing in large-scale resources at the start.
The implementation of the architecture that's described in Deploy this scenario
deploys a starting solution that's suitable for proof of concept. The deployment
scripts create a working architecture with minimal resource requirements. For
example, the deployment scripts create a smallest serverless Linux host to run the
function app.
Performance efficiency
Performance efficiency is the ability of your workload to scale in an efficient manner to
meet the demands that are placed on it by users. For more information, see
Performance efficiency pillar overview.
This architecture uses services that have built-in scaling capabilities that you can use to
improve performance efficiency. Here are some examples:
You can host both Azure Logic Apps and Azure Functions in a serverless
infrastructure. For more information, see Azure serverless overview: Create cloud-
based apps and solutions with Azure Logic Apps and Azure Functions.
You can configure Azure Cosmos DB to automatically scale its throughput. For
more information, see Provision autoscale throughput on a database or container
in Azure Cosmos DB - API for NoSQL.
The accelerator receives the PDF forms, extracts the data fields, and saves the data in
Azure Cosmos DB. Power BI visualizes the data. The design uses a modular, metadata-
driven methodology. No form fields are hard-coded. It can process any PDF forms.
You can use the accelerator as is, without code modification, to process and visualize
any single-page PDF forms such as safety forms, invoices, incident records, and many
others. To use it, you only need to collect sample PDF forms, train a new model to learn
the layout of the forms, and plug the model into the solution. You also need to redesign
the Power BI report for your datasets so that it provides the insights that you want.
The implementation uses Form Recognizer Studio to create custom models. The
accelerator uses the field names that are saved in the machine learning model as a
reference to process other forms. Only five sample forms are needed to create a
custom-built machine learning model. You can merge as many as 100 custom-built
models to create a composite machine learning model that can process a variety of
forms.
Deployment repository
The GitHub repository for the solution accelerator is:
https://github.com/microsoft/Azure-PDF-Form-Processing-Automation-Solution-
Accelerator
The readme file that's displayed at that location provides an overview of the accelerator.
The deployment files are in the top-level Deployment folder of the repository:
https://github.com/microsoft/Azure-PDF-Form-Processing-Automation-Solution-
Accelerator/tree/main/Deployment
The readme file that's displayed at that location is the deployment guide. You deploy by
following the steps.
Step 2 provides details about using sample PDF forms to create a custom-built machine
learning model. You plug the model into the solution by setting the environment
variable called CUSTOM_BUILT_MODEL_ID to the machine model name in the function
app. For more information, see step 3.
Deployment prerequisites
To deploy, you need an Azure subscription. For information about free subscriptions, see
Build in the cloud with an Azure free account .
To learn about the services that are used in the accelerator, see the overview and
reference articles that are listed in:
Deployment considerations
To process a new type of PDF form, you use sample PDF files to create a new machine
learning model. When the model is ready, you plug the model ID into the solution.
This container name is configurable in the deployment scripts that you get from the
GitHub repository.
The architecture doesn't address any high availability (HA) or disaster recovery (DR)
requirements. If you want to extend and enhance the current architecture for production
deployment, consider the following recommendations and best practices:
Design the HA/DR architecture based on your requirements and use the built-in
redundancy capabilities where applicable.
Update the Bicep deployment code to create a computing environment that can
handle your processing volumes.
Update the Bicep deployment code to create more instances of the architecture
components to satisfy your HA/DR requirements.
Follow the guidelines in Azure Storage redundancy when you design and provision
storage.
Follow the guidelines in Business continuity and disaster recovery when you design
and provision the logic apps.
Follow the guidelines in Reliability in Azure Functions when you design and
provision the function app.
Follow the guidelines in Achieve high availability with Azure Cosmos DB when you
design and provision a database that was created by using Azure Cosmos DB.
If you consider putting this system into production to process large volumes of
PDF forms, you can modify the deployment scripts to create a Linux Host that has
more resources. To do so, modify the code inside
https://github.com/microsoft/Azure-PDF-Form-Processing-Automation-Solution-
Accelerator/blob/main/Deployment/1_deployment_scripts/deploy-
functionsapp.bicep
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Gail Zhou | Sr. Architect
Other contributors:
Next steps
Video: Azure PDF Form Processing Automation SA .
Azure PDF Form Processing Automation Solution Accelerator
Azure invoice Process Automation Solution Accelerator
Business Process Automation Accelerator
Tutorial: Create workflows that process emails using Azure Logic Apps, Azure
Functions, and Azure Storage
Related resources
Custom document processing models on Azure
Index file content and metadata by using Azure Cognitive Search
Automate document identification, classification, and search by using Durable
Functions
Automate document processing by using Azure Form Recognizer
Custom document processing
models on Azure
Azure Document Intelligence Azure AI services Azure Logic Apps Azure Machine Learning Studio
Azure Storage
This article describes Azure solutions for building, training, deploying, and using custom
document processing models. These Azure services also offer user interface (UI)
capabilities to do labeling or tagging for text processing.
Architecture
Data ingestion Labeling, tagging,
Source Data store Deployment
and orchestration and training
Built-in
deployment
1 2 3
4
Language Cognitive Service for Language
FTP server Data Factory Studio (custom model parameters)
Data Lake
Storage
Machine Learning Kubernetes Batch/online
Web Apps Function Apps managed
Studio Services
endpoints
Microsoft
Azure
Dataflow
1. Orchestrators like Azure Logic Apps, Azure Data Factory, or Azure Functions ingest
messages and attachments from email servers, and files from FTP servers or web
applications.
Azure Functions and Logic Apps enable serverless workloads. The service you
choose depends on your preference for service capabilities like development,
connectors, management, and execution context. For more information, see
Compare Azure Functions and Azure Logic Apps.
3. Form Recognizer Studio, Language Studio, or Azure Machine Learning studio label
and tag textual data and build the custom models. You can use these three services
independently or in various combinations to address different use cases.
Azure Machine Learning studio can also do labeling for text classification or
entity extraction with open-source frameworks like PyTorch or TensorFlow.
Form Recognizer has built-in model deployment. Use Form Recognizer SDKs
or the REST API to apply custom models for inferencing. Include the model
ID or custom model name in the Form Recognizer request URL,
depending on the API version. Form Recognizer doesn't require any further
deployment steps.
Azure Machine Learning can deploy custom models to online or batch Azure
Machine Learning managed endpoints. You can also deploy to Azure
Kubernetes Service (AKS) as a web service by using the Azure Machine
Learning SDK.
Components
Logic Apps is part of Azure Integration Services . Logic Apps creates
automated workflows that integrate apps, data, services, and systems. With
managed connectors for services like Azure Storage and Office 365, you can
trigger workflows when a file lands in the storage account or email is received.
Data Factory is a managed cloud extract, transform, load (ETL) service for data
integration and transformation. Data Factory can add transformation activities to a
pipeline that include invoking a REST endpoint or running a notebook on the
ingested data.
Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for multiple languages, such as .NET, Node.js, and
Python. Applications can access files on Blob Storage via HTTP/HTTPS. Blob
Storage has hot, cool, and archive access tiers to support cost optimization for
storing large amounts of data.
Data Lake Storage is a set of capabilities built on Azure Blob Storage for big data
analytics. Data Lake Storage retains the cost effectiveness of Blob Storage, and
provides features like file-level security and file system semantics with hierarchical
namespace.
Azure Cognitive Service for Language consolidates the Azure natural language
processing services. The suite offers prebuilt and customizable options. For more
information, see the Cognitive Service for Language available features.
Alternatives
You can add more workflows to this scenario based on specific use cases.
If the document is in image or PDF format, you can extract the data by using Azure
Computer Vision, Form Recognizer Read API, or open-source libraries.
Use pre-processing code to do text processing steps like cleaning, stop words
removal, lemmatization, stemming, and text summarization on extracted data, per
document processing requirements. You can expose the code as REST APIs for
automation. Do these steps manually or automate them by integrating with the
Logic Apps or Azure Functions ingestion process.
Scenario details
Document processing is a broad area. It can be difficult to meet all your document
processing needs with the prebuilt models available in Azure Form Recognizer and
Azure Cognitive Service for Language. You might need to build custom models to
automate document processing for different applications and domains.
Labeling or tagging text data with relevant key-value pair entities to classify text
for extraction.
Deploying models securely at scale for easy integration with consuming
applications.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
For this example workload, implementing each pillar depends on optimally configuring
and using each component Azure service.
Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
Availability
See the availability service level agreements (SLAs) for each component Azure
service:
Azure Form Recognizer - SLA for Azure Applied AI Services .
Azure Cognitive Service for Language - SLA for Azure Cognitive Services .
Azure Functions - SLA for Azure Functions .
Azure Kubernetes Service - SLA for Azure Kubernetes Service (AKS) .
Azure Storage - SLA for Storage Accounts .
Resiliency
Handle failure modes of individual services like Azure Functions and Azure Storage
to ensure resiliency of the compute services and data stores in this scenario. For
more information, see Resiliency checklist for specific Azure services.
For Form Recognizer, back up and recover your Form Recognizer models.
For custom text classification with Cognitive Services for Language, back up and
recover your custom text classification models.
For custom NER in Cognitive Services for Language, back up and recover your
custom NER models.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Implement data protection, identity and access management, and network security
recommendations for Blob Storage, Cognitive Services for Form Recognizer and
Language Studio, and Azure Machine Learning.
Azure Functions can access resources in a virtual network through virtual network
integration.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
The total cost of implementing this solution depends on the pricing of the services you
choose.
The compute cost involved in Azure Machine Learning training. Choose the right
node type, cluster size, and number of nodes to help optimize costs. Azure
Machine Learning provides options to set the minimum nodes to zero and to set
the idle time before the scale down. For more information, see Manage and
optimize Azure Machine Learning costs.
Data orchestration duration and activities. For Azure Data Factory, the charges for
copy activities on the Azure integration runtime are based on the number of Data
Integration Units (DIUs) used and the execution duration. Added orchestration
activity runs are also charged, based on their number.
Logic Apps pricing plans depend on the resources you create and use. The
following articles can help you choose the right plan for specific use cases:
Costs that typically accrue with Azure Logic Apps
Single-tenant versus multi-tenant and integration service environment for Azure
Logic Apps
Usage metering, billing, and pricing models for Azure Logic Apps
For more information on pricing for specific components, see the following resources:
Use the Azure pricing calculator to add your selected component options and
estimate the overall solution cost.
Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.
Scalability
To scale Azure Functions automatically or manually, choose the right hosting plan.
For Azure Machine Learning custom models hosted as web services on AKS, the
azureml-fe front end automatically scales as needed. This component also routes
incoming inference requests to deployed services.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributor.
Principal author:
Next steps
Get started: Form Recognizer Studio
Use Form Recognizer SDKs or REST API
Quickstart: Get started with Language Studio
What is optical character recognition (OCR)?
How to configure Azure Functions with a virtual network
Related resources
Extract text from objects using Power Automate and AI Builder
Suggest content tags with NLP using deep learning
Knowledge mining for content research
Automate document processing by using Azure Form Recognizer
Index file content and metadata
by using Azure Cognitive Search
Azure Cognitive Search Azure Blob Storage Azure Table Storage
This article demonstrates how to create a search service that enables users to search for
documents based on document content in addition to any metadata that's associated
with the files.
You can implement this service by using multiple indexers in Azure Cognitive Search.
This article uses an example workload to demonstrate how to create a single search
index that's based on files in Azure Blob Storage. The file metadata is stored in Azure
Table Storage.
Architecture
Download a PowerPoint file of this architecture.
Dataflow
1. Files are stored in Blob Storage, possibly together with a limited amount of
metadata (for example, the document's author).
2. Additional metadata is stored in Table Storage, which can store significantly more
information for each document.
3. An indexer reads the contents of each file, together with any blob metadata, and
stores the data in the search index.
4. Another indexer reads the additional metadata from the table and stores it in the
same search index.
5. A search query is sent to the search service. The query returns matching
documents, based on both document content and document metadata.
Components
Blob Storage provides cost-effective cloud storage for file data, including data in
formats like PDF, HTML, and CSV, and in Microsoft Office files.
Table Storage provides storage for nonrelational structured data. In this scenario,
it's used to store the metadata for each document.
Azure Cognitive Search is a fully managed search service that provides
infrastructure, APIs, and tools for building a rich search experience.
Alternatives
This scenario uses indexers in Azure Cognitive Search to automatically discover new
content in supported data sources, like blob and table storage, and then add it to the
search index. Alternatively, you can use the APIs provided by Azure Cognitive Search to
push data to the search index. If you do, however, you need to write code to push the
data into the search index and also to parse and extract text from the binary documents
that you want to search. The Blob Storage indexer supports many document formats,
which significantly simplifies the text extraction and indexing process.
Also, if you use indexers, you can optionally enrich the data as part of an indexing
pipeline. For example, you can use Azure Cognitive Services to perform optical character
recognition (OCR) or visual analysis of the images in documents, detect the language of
documents, or translate documents. You can also define your own custom skills to
enrich the data in ways that are relevant to your business scenario.
This architecture uses blob and table storage because they're cost-effective and
efficient. This design also enables combined storage of the documents and metadata in
a single storage account. Alternative supported data sources for the documents
themselves include Azure Data Lake Storage and Azure Files. Document metadata can
be stored in any other supported data source that holds structured data, like Azure SQL
Database and Azure Cosmos DB.
Scenario details
Azure Cognitive Search is a fully managed search service that can create search indexes
that contain the information you want to allow users to search for.
Because the files that are searched in this scenario are binary documents, you can store
them in Blob Storage. If you do, you can use the built-in Blob Storage indexer in Azure
Cognitive Search to automatically extract text from the files and add their content to the
search index.
To overcome this storage limitation, you can place additional metadata in another data
source that has a supported indexer, like Table Storage. You can add the document type,
business impact, and other metadata values as separate columns in the table. If you
configure the built-in Table Storage indexer to target the same search index as the blob
indexer, the blob and table storage metadata is combined for each document in the
search index.
property stores the full URL of the file in Blob Storage, for example,
https://contoso.blob.core.windows.net/files/paper/Resilience in Azure.pdf . The
indexer performs Base64 encoding on the value to ensure that there are no invalid
characters in the document key. The result is a unique document key, like
aHR0cHM6...mUucGRm0 .
If you add the metadata_storage_path as a column in Table Storage, you know exactly
which blob the metadata in the other columns belongs to, so you can use any
PartitionKey and RowKey value in the table. For example, you could use the blob
container name as the PartitionKey and the Base64-encoded full URL of the blob as the
RowKey , ensuring that there are no invalid characters in these keys either.
You can then use a field mapping in the table indexer to map the
metadata_storage_path column (or another column) in Table Storage to the
metadata_storage_path document key field in the search index. If you apply the
base64Encode function on the field mapping, you end up with the same document key
( aHR0cHM6...mUucGRm0 in the earlier example), and the metadata from Table Storage is
added to the same document that was extracted from Blob Storage.
7 Note
The table indexer documentation states that you shouldn't define a field mapping
to an alternative unique string field in your table. That's because the indexer
concatenates the PartitionKey and RowKey as the document key, by default.
Because you're already relying on the document key as configured by the blob
indexer (which is the Base64-encoded full URL of the blob), creating a field
mapping to ensure that both indexers refer to the same document in the search
index is appropriate and supported for this scenario.
Alternatively, you can map the RowKey (which is set to the Base64-encoded full URL of
the blob) to the metadata_storage_path document key directly, without storing it
separately and Base64-encoding it as part of the field mapping. However, keeping the
unencoded URL in a separate column clarifies which blob it refers to and allows you to
choose any partition and row keys without affecting the search indexer.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that you can use to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures that your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
Azure Cognitive Search provides a high SLA for reads (querying) if you have at least
two replicas. It provides a high SLA for updates (updating the search indexes) if you have
at least three replicas. You should therefore provision at least two replicas if you want
your users to be able to search reliably, and three if actual changes to the index also
need to be high-availability operations.
Azure Storage always stores multiple copies of your data to help protect it against
planned and unplanned events. Azure Storage provides additional redundancy options
for replicating data across regions. These safeguards apply to data in blob and table
storage.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Azure Cognitive Search provides robust security controls that help you implement
network security, authentication and authorization, data residency and protection, and
administrative controls that help you maintain security, privacy, and compliance.
Whenever possible, use Microsoft Entra authentication to provide access to the search
service itself, and connect your search service to other Azure resources (like blob and
table storage in this scenario) by using a managed identity.
You can connect from the search service to the storage account by using a private
endpoint. When you use a private endpoint, the indexers can use a private connection
without requiring the blob and table storage to be accessible publicly.
Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.
For information about the costs of running this scenario, see this preconfigured estimate
in the Azure pricing calculator . All the services described here are configured in this
estimate. The estimate is for a workload that has a total document size of 20 GB in Blob
Storage and 1 GB of metadata in Table Storage. Two search units are used to satisfy the
SLA for read purposes, as described in the reliability section of this article. To see how
the pricing would change for your particular use case, change the appropriate variables
to match your expected usage.
If you review the estimate, you can see that the cost of blob and table storage is
relatively low. Most of the cost is incurred by Azure Cognitive Search, because it
performs the actual indexing and compute for running search queries.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributor:
Next steps
Get started with Azure Cognitive Search
Increase relevancy using semantic search in Azure Cognitive Search
Security filters for trimming results in Azure Cognitive Search
Tutorial: Index from multiple data sources using the .NET SDK
Related resources
Choose a search data store in Azure
Intelligent product search engine for e-commerce
Process free-form text for search
Analyze video content with
Computer Vision and Azure
Machine Learning
Azure Machine Learning Azure AI services Azure Logic Apps Azure Synapse Analytics
This article describes an architecture that you can use to replace the manual analysis of
video footage with an automated, and frequently more accurate, machine learning
process.
The FFmpeg and Jupyter Notebook logos are trademarks of their respective companies. No
endorsement is implied by the use of these marks.
Architecture
Workflow
1. A collection of video footage, in MP4 format, is uploaded to Azure Blob Storage.
Ideally, the videos go into a "raw" container.
2. A preconfigured pipeline in Azure Machine Learning recognizes that video files are
uploaded to the container and initiates an inference cluster to start separating the
video footage into frames.
3. FFmpeg, an open-source tool, breaks down the video and extracts frames. You can
configure how many frames per second are extracted, the quality of the extraction,
and the format of the image file. The format can be JPG or PNG.
4. The inference cluster sends the images to Azure Data Lake Storage.
5. A preconfigured logic app that monitors Data Lake Storage detects that new
images are being uploaded. It starts a workflow.
6. The logic app calls a pretrained custom vision model to identify objects, features,
or qualities in the images. Alternatively or additionally, it calls a computer vision
(optical character recognition) model to identify textual information in the images.
7. Results are received in JSON format. The logic app parses the results and creates
key-value pairs. You can store the results in Azure dedicated SQL pools that are
provisioned by Azure Synapse Analytics.
8. Power BI provides data visualization.
Components
Azure Blob Storage provides object storage for cloud-native workloads and
machine learning stores. In this architecture, it stores the uploaded video files.
Azure Machine Learning is an enterprise-grade machine learning service for the
end-to-end machine learning lifecycle.
Azure Data Lake Storage provides massively scalable, enhanced-security, cost-
effective cloud storage for high-performance analytics workloads.
Computer Vision is part of Azure Cognitive Services . It's used to retrieve
information about each image.
Custom Vision enables you to customize and embed state-of-the-art computer
vision image analysis for your specific domains.
Azure Logic Apps automates workflows by connecting apps and data across
environments. It provides a way to access and process data in real time.
Azure Synapse Analytics is a limitless analytics service that brings together data
integration, enterprise data warehousing, and big data analytics.
Dedicated SQL pool (formerly SQL DW) is a collection of analytics resources that
are provisioned when you use Azure Synapse SQL.
Power BI is a collection of software services, apps, and connectors that work
together to provide visualizations of your data.
Alternatives
Azure Video Indexer is a video analytics service that uses AI to extract actionable
insights from stored videos. You can use it without any expertise in machine
learning.
Azure Data Factory is a fully managed serverless data integration service that
helps you construct ETL and ELT processes.
Azure Functions is a serverless platform as a service (PaaS) that runs single-task
code without requiring new infrastructure.
Azure Cosmos DB is a fully managed NoSQL database for modern app
development.
Scenario details
Many industries record video footage to detect the presence or absence of a particular
object or entity or to classify objects or entities. Video monitoring and analyses are
traditionally performed manually. These processes are often monotonous and prone to
errors, particularly for tasks that are difficult for the human eye. You can automate these
processes by using AI and machine learning.
A video recording can be separated into individual frames so that various technologies
can analyze the images. One such technology is computer vision: the capability of a
computer to identify objects and entities on an image.
Agriculture. Monitor and analyze crops and soil conditions over time. By using
drones or UAVs, farmers can record video footage for analysis.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework, a
set of guiding tenets that you can use to improve the quality of a workload. For more
information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
A reliable workload is one that's both resilient and available. Resiliency is the ability of
the system to recover from failures and continue to function. The goal of resiliency is to
return the application to a fully functioning state after a failure occurs. Availability is a
measure of whether your users can access your workload when they need to.
For the availability guarantees of the Azure services in this solution, see these resources:
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Identity management
Protect your infrastructure
Application security
Data sovereignty and encryption
Security resources
Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.
Use the pay-as-you-go strategy for your architecture, and scale out as needed
rather than investing in large-scale resources at the start.
Consider opportunity costs in your architecture, and the balance between first-
mover advantage versus fast follow. Use the pricing calculator to estimate the
initial cost and operational costs.
Establish policies, budgets, and controls that set cost limits for your solution.
Operational excellence
Operational excellence covers the operations processes that deploy an application and
keep it running in production. For more information, see Overview of the operational
excellence pillar.
Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.
Appropriate use of scaling and the implementation of PaaS offerings that have built-in
scaling are the main ways to achieve performance efficiency.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributors:
Next steps
Introduction to Azure Storage
What is Azure Machine Learning?
What is Azure Cognitive Services?
What is Azure Logic Apps?
What is Azure Synapse Analytics?
What is Power BI embedded analytics?
Business Process Accelerator
Related resources
Image classification with convolutional neural networks (CNNs)
Image classification on Azure
MLOps framework to upscale machine learning lifecycle
Image classification on Azure
Azure Blob Storage Azure Computer Vision Azure Cosmos DB Azure Event Grid Azure Functions
By using Azure services, such as the Computer Vision API and Azure Functions,
companies can eliminate the need to manage individual servers, while reducing costs
and utilizing the expertise that Microsoft has already developed with processing images
with Cognitive Services. This example scenario specifically addresses an image-
processing use case. If you have different AI needs, consider the full suite of Cognitive
Services.
Architecture
Workflow
This scenario covers the back-end components of a web or mobile application. Data
flows through the scenario as follows:
1. Adding new files (image uploads) in Blob storage triggers an event in Azure Event
Grid. The uploading process can be orchestrated via the web or a mobile
application. Alternatively, images can be uploaded separately to the Azure Blob
storage.
2. Event Grid sends a notification that triggers the Azure Functions.
3. Azure Functions calls the Azure Computer Vision API to analyze the newly
uploaded image. Computer Vision accesses the image via the blob URL that's
parsed by Azure Functions.
4. Azure Functions persists the Computer Vision API response in Azure Cosmos DB.
This response includes the results of the analysis, along with the image metadata.
5. The results can be consumed and reflected on the web or mobile front end. Note
that this approach retrieves the results of the classification but not the uploaded
image.
Components
Computer Vision API is part of the Cognitive Services suite and is used to
retrieve information about each image.
Azure Functions provides the back-end API for the web application. This
platform also provides event processing for uploaded images.
Azure Event Grid triggers an event when a new image is uploaded to blob
storage. The image is then processed with Azure functions.
Azure Blob Storage stores all of the image files that are uploaded into the web
application, as well any static files that the web application consumes.
Azure Cosmos DB stores metadata about each image that is uploaded, including
the results of the processing from Computer Vision API.
Alternatives
Custom Vision Service . The Computer Vision API returns a set of taxonomy-
based categories. If you need to process information that isn't returned by the
Computer Vision API, consider the Custom Vision Service, which lets you build
custom image classifiers.
Cognitive Search (formerly Azure Search). If your use case involves querying the
metadata to find images that meet specific criteria, consider using Cognitive
Search. Currently in preview, Cognitive search seamlessly integrates this
workflow.
Logic Apps . If you don't need to react in real-time on added files to a blob, you
might consider using Logic Apps. A logic app which can check if a file was added
might be start by the recurrence trigger or sliding windows trigger.
Scenario details
This scenario is relevant for businesses that need to process images.
Potential applications include classifying images for a fashion website, analyzing text
and images for insurance claims, or understanding telemetry data from game
screenshots. Traditionally, companies would need to develop expertise in machine
learning models, train the models, and finally run the images through their custom
process to get the data out of the images.
Classifying images for insurance claims. Image classification can help reduce the
time and cost of claims processing and underwriting. It could help analyze natural-
disaster damage, vehicle-damage, and identify residential and commercial
properties.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Scalability
The majority of the components used in this example scenario are managed services
that will automatically scale. A couple of notable exceptions: Azure Functions has a limit
of a maximum of 200 instances. If you need to scale beyond this limit, consider multiple
regions or app plans.
You can provision Azure Cosmos DB to autoscale in Azure Cosmos DB for NoSQL only. If
you plan to use other APIs, see guidance on estimating your requirements in Request
units. To fully take advantage of the scaling in Azure Cosmos DB, understand how
partition keys work in Azure Cosmos DB.
NoSQL databases frequently trade consistency (in the sense of the CAP theorem) for
availability, scalability, and partitioning. In this example scenario, a key-value data model
is used and transaction consistency is rarely needed as most operations are by definition
atomic. Additional guidance to Choose the right data store is available in the Azure
Architecture Center. If your implementation requires high consistency, you can choose
your consistency level in Azure Cosmos DB.
For general guidance on designing scalable solutions, see the performance efficiency
checklist in the Azure Architecture Center.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Managed identities for Azure resources are used to provide access to other resources
internal to your account and then assigned to your Azure Functions. Only allow access
to the requisite resources in those identities to ensure that nothing extra is exposed to
your functions (and potentially to your customers).
For general guidance on designing secure solutions, see the Azure Security
Documentation.
Resiliency
All of the components in this scenario are managed, so at a regional level they are all
resilient automatically.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
To explore the cost of running this scenario, all of the services are pre-configured in the
cost calculator. To see how the pricing would change for your particular use case,
change the appropriate variables to match your expected traffic.
We have provided three sample cost profiles based on amount of traffic (we assume all
images are 100 kb in size):
Small : this pricing example correlates to processing < 5000 images a month.
Medium : this pricing example correlates to processing 500,000 images a month.
Large : this pricing example correlates to processing 50 million images a month.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
Product documentation
Related resources
AI enrichment with image and natural language processing in Azure Cognitive Search
Use a speech-to-text transcription
pipeline to analyze recorded
conversations
Azure AI Speech Azure AI Language Azure AI services Azure Synapse Analytics Azure Logic Apps
Speech recognition and analysis of recorded customer calls can provide your business
with valuable information about current trends, product shortcomings, and successes.
The example solution described in this article outlines a repeatable pipeline for
transcribing and analyzing conversation data.
Architecture
The architecture consists of two pipelines: A transcription pipeline to convert audio to
text, and an enrichment and visualization pipeline.
Transcription pipeline
Dataflow
1. Audio files are uploaded to an Azure Storage account via any supported method.
You can use a UI-based tool like Azure Storage Explorer or use a storage SDK or
API.
2. The upload to Azure Storage triggers an Azure logic app. The logic app accesses
any necessary credentials in Azure Key Vault and makes a request to the Speech
service's batch transcription API.
3. The logic app submits the audio files call to the Speech service, including optional
settings for speaker diarization.
4. The Speech service completes the batch transcription and loads the transcription
results to the Storage account.
Dataflow
5. An Azure Synapse Analytics pipeline runs to retrieve and process the transcribed
audio text.
6. The pipeline sends processed text via an API call to the Language service. The
service performs various natural language processing (NLP) enrichments, like
sentiment and opinion mining, summarization, and custom and pre-built named
entity recognition.
7. The processed data is stored in an Azure Synapse Analytics SQL pool, where it can
be served to visualization tools like Power BI.
Components
Azure Blob Storage. Massively scalable and secure object storage for cloud-
native workloads, archives, data lakes, high-performance computing, and machine
learning. In this solution, it stores the audio files and transcription results and
serves as a data lake for downstream analytics.
Azure Logic Apps. An integration platform as a service (iPaaS) that's built on a
containerized runtime. In this solution, it integrates storage and speech AI services.
Azure Cognitive Services Speech service. An AI-based API that provides speech
capabilities like speech-to-text, text-to-speech, speech translation, and speaker
recognition. Its batch transcription functionality is used in this solution.
Azure Cognitive Service for Language. An AI-based managed service that
provides natural language capabilities like sentiment analysis, entity extraction, and
automated question answering.
Azure Synapse Analytics. A suite of services that provide data integration,
enterprise data warehousing, and big data analytics. In this solution, it transforms
and enriches transcription data and serves data to downstream visualization tools.
Power BI. A data modeling and visual analytics tool. In this solution, it presents
transcribed audio insights to users and decision makers.
Alternatives
Here are some alternative approaches to this solution architecture:
Scenario details
Customer care centers are an integral part of the success of many businesses in many
industries. This solution uses the Speech API from Azure Cognitive Services for the audio
transcription and diarization of recorded customer calls. Azure Synapse Analytics is used
to process and perform NLP tasks like sentiment analysis and custom named entity
recognition through API calls to Azure Cognitive Service for Language.
You can use the services and pipeline described here to process transcribed text to
recognize and remove sensitive information, perform sentiment analysis, and more. You
can scale the services and pipeline to accommodate any volume of recorded data.
Potential use cases
This solution can provide value to organizations in many industries, including
telecommunications, financial services, and government. It applies to any organization
that records conversations. In particular, customer-facing or internal call centers or
support desks can benefit from the insights derived from this solution.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that you can use to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
The request to the Speech API can include a shared access signature (SAS) URI for
a destination container in Azure Storage. A SAS URI enables the Speech service to
directly output the transcription files to the container location. If your organization
doesn't allow the use of SAS URIs for storage, you need to implement a function to
periodically poll the Speech API for completed assets.
Credentials like account or API keys should be stored in Azure Key Vault as secrets.
Configure your Logic Apps and Azure Synapse pipelines to access the key vault by
using managed identities to avoid storing secrets in application settings or code.
The audio files that are stored in the blob might contain sensitive customer data. If
multiple clients are using the solution, you need to restrict access to these files.
Use hierarchical namespace on the storage account and enforce folder and file
level permissions to limit access to only the needed Microsoft Entra instance.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
All Azure services described in this architecture provide an option for pay-as-you-go
billing, so solution costs scale linearly.
Azure Synapse provides an option for serverless SQL pools, so the compute for the data
warehousing workload can be spun up on demand. If you aren't using Azure Synapse to
serve other downstream use cases, consider using serverless to reduce costs.
See Overview of the cost optimization pillar for more cost optimization strategies.
For pricing for the services suggested here, see this estimate in the Azure pricing
calculator .
Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.
The batch speech API is designed for high volume, but other Cognitive Services APIs
might have request limits for each subscription tier. Consider containerizing these APIs
to avoid throttling large-volume processing. Containers give you flexibility in
deployment, in the cloud or on-premises. You can also mitigate side effects of new
version rollouts by using containers. For more information, see Container support in
Azure Cognitive Services.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Other contributor:
Next steps
Quickstart: Recognize and convert speech to text
Quickstart: Create an integration workflow with multi-tenant Azure Logic Apps and
the Azure portal
Quickstart: Get started with Language Studio
Cognitive Services in Azure Synapse Analytics
What is the Speech service?
What is Azure Logic Apps?
What is Azure Cognitive Service for Language?
What is Azure Synapse Analytics?
Extract insights from text with the Language service
Model, query, and explore data in Azure Synapse
Related resources
Natural language processing technology
Optimize marketing with machine learning
Big data analytics with enterprise-grade security using Azure Synapse
Extract and analyze call center
data
Azure Blob Storage Azure AI Speech Azure AI services Power BI
This article describes how to extract insights from customer conversations at a call
center by using Azure AI services and Azure OpenAI Service. Use these real-time and
post-call analytics to improve call center efficiency and customer satisfaction.
Architecture
Detailed call
history, summaries,
reasons for calling
Download a PowerPoint file of this architecture.
Dataflow
1. A phone call between an agent and a customer is recorded and stored in Azure
Blob Storage. Audio files are uploaded to an Azure Storage account via a
supported method, such as the UI-based tool, Azure Storage Explorer, or a Storage
SDK or API.
For batch mode transcription and personal data detection and redaction, use the
AI services Ingestion Client tool. The Ingestion Client tool uses a no-code approach
for call center transcription.
4. Azure OpenAI is used to process the transcript and extract entities, summarize the
conversation, and analyze sentiments. The processed output is stored in Blob
Storage and then analyzed and visualized by using other services. You can also
store the output in a datastore for keeping track of metadata and for reporting.
Use Azure OpenAI to process the stored transcription information.
Components
Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for languages like .NET, Node.js, and Python.
Applications can access files on Blob Storage via HTTP or HTTPS. Blob Storage has
hot, cool, and archive access tiers for storing large amounts of data, which
optimizes cost.
Azure OpenAI provides access to the Azure OpenAI language models, including
GPT-3, Codex, and the embeddings model series, for content generation,
summarization, semantic search, and natural language-to-code translation. You
can access the service through REST APIs, Python SDK, or the web-based interface
in the Azure OpenAI Studio .
Azure AI Speech is an AI-based API that provides speech capabilities like speech-
to-text, text-to-speech, speech translation, and speaker recognition. This
architecture uses the Azure AI Speech batch transcription functionality.
Alternatives
Depending on your scenario, you can add the following workflows.
Scenario details
This solution uses Azure AI Speech to convert audio into written text. Azure AI Language
redacts sensitive information in the conversation transcription. Azure OpenAI extracts
insights from customer conversation to improve call center efficiency and customer
satisfaction. Use this solution to process transcribed text, recognize and remove
sensitive information, and perform sentiment analysis. Scale the services and the
pipeline to accommodate any volume of recorded data.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
Find the availability service level agreement (SLA) for each component in SLAs for
online services .
To design high-availability applications with Storage accounts, see the
configuration options.
To ensure resiliency of the compute services and datastores in this scenario, use
failure mode for services like Azure Functions and Storage. For more information,
see the resiliency checklist for Azure services.
Back up and recover your Form Recognizer models.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Implement data protection, identity and access management, and network security
recommendations for Blob Storage, AI services, and Azure OpenAI.
Configure AI services virtual networks.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
The total cost of this solution depends on the pricing tier of your services. Factors that
can affect the price of each component are:
Performance efficiency
Performance efficiency is the ability of your workload to meet the demands placed on it
by users in an efficient manner. For more information, see Overview of the performance
efficiency pillar.
When high volumes of data are processed, it can expose performance bottlenecks. To
ensure proper performance efficiency, understand and plan for the scaling options to
use with the AI services autoscale feature.
The batch speech API is designed for high volumes, but other AI services APIs might
have request limits, depending on the subscription tier. Consider containerizing AI
services APIs to avoid slowing down large-volume processing. Containers provide
deployment flexibility in the cloud and on-premises. Mitigate side effects of new version
rollouts by using containers. For more information, see Container support in AI services.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
What is Azure AI Speech?
What is Azure OpenAI?
What is Azure Machine Learning?
Introduction to Blob Storage
What is Azure AI Language?
Introduction to Azure Data Lake Storage Gen2
What is Power BI?
Ingestion Client with AI services
Post-call transcription and analytics
Related resources
Use a speech-to-text transcription pipeline to analyze recorded conversations
Deploy a custom speech-to-text solution
Create custom language and acoustic models
Deploy a custom speech-to-text solution
Determine customer lifetime and
churn with Azure AI services
Azure Data Lake Storage Azure Databricks Azure Machine Learning Azure Analysis Services
This scenario shows a solution for creating predictive models of customer lifetime
value and churn rate by using Azure AI technologies.
Architecture
Azure Data Factory Azure Databricks Azure Machine Learning Azure Analysis Services
Event Grid Subscriptions MLflow MLflow BI Platforms
Azure Kubernetes Service
Model training
Machine Serving phase
learning registry
Azure
intelligent
analytics and
AI platforms
Power BI
Feature engineering dashboard
Data Factory Machine
learning
deployment
layer
Data processing
Dashboard
Azure storage
technologies
Azure Data Lake Storage Azure SQL Database Azure Analysis Services
Microsoft
Azure
Dataflow
1. Ingestion and orchestration: Ingest historical, transactional, and third-party data
for the customer from on-premises data sources. Use Azure Data Factory and store
the results in Azure Data Lake Storage.
2. Data processing: Use Azure Databricks to pick up and clean the raw data from the
Data Lake Storage. Store the data in the silver layer in Azure Data Lake Storage.
3. Feature engineering: With Azure Databricks, load data from the silver layer of Data
Lake Storage. Use PySpark to enrich the data. After preparation, use feature
engineering to provide a better representation of data. Feature engineering can
also improve the performance of the machine learning algorithm.
4. Model training: In model training, the silver tier data is the model training dataset.
You can use MLflow to manage machine learning experiments. MLflow keeps track
of all metrics you need to evaluate your machine learning experiment.
5. Machine learning registry: An Azure Data Factory pipeline registers the best
machine learning model in the Azure Machine Learning Service according to the
metrics chosen. The machine learning model is deployed by using the Azure
Kubernetes Service .
6. Serving phase: In the serving phase, you can use reporting tools to work with your
model predictions. These tools include Power BI and Azure Analyses Services.
Components
Azure Analysis Services provides enterprise-grade data models in the cloud.
Azure Data Factory provides a data integration and transformation layer that
works across your digital transformation initiatives.
Azure Databricks is a data analytics platform optimized for the Microsoft Azure
cloud services platform.
Alternatives
Data Factory orchestrates the workflows for your data pipeline. If you want to load
data only one time or on demand, use tools like SQL Server bulk copy and AzCopy
to copy data into Azure Blob Storage . You can then load the data directly into
Azure Synapse Analytics using PolyBase.
Some business intelligence tools may not support Azure Analysis Services. The
curated data can instead be accessed directly from Azure SQL Database. Data is
stored using Azure Data Lake Storage and accessed using Azure Databricks
storage for data processing.
Scenario details
Customer lifetime value measures the net profit from a customer. This metric includes
profit from the customer's whole relationship with your company. Churn or churn rate
measures the number of individuals or items moving out of a group over a period.
This retail customer scenario classifies your customers based on marketing and
economic measures. This scenario also creates a customer segmentation based on
several metrics. It trains a multi-class classifier on new data. The resulting model scores
batches of new customer orders through a regularly scheduled Azure Databricks
notebook job.
Use Azure Data Lake and Azure Databricks to implement best practices for data
operations.
Use Azure Databricks to do exploratory data analysis.
Use MLflow to track machine learning experiments.
Batch score machine learning models on Azure Databricks.
Use Azure Machine Learning to model registration and deployment.
Use Azure Data Factory and Azure Databricks notebooks to orchestrate the MLOps
pipeline.
Potential use cases
This solution is ideal for the retail industry. It's helpful in the following use cases:
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Availability
Azure components offer availability through redundancy and as specified in service-level
agreements (SLAs):
For information about Data Factory pipelines, see SLA for Data Factory .
For information about Azure Databricks, see Azure Databricks .
Data Lake Storage offers availability through redundancy. See Azure Storage
redundancy.
Scalability
This scenario uses Azure Data Lake Storage to store data for machine learning models
and predictions. Azure Storage is scalable. It can store and serve many exabytes of data.
This amount of storage is available with throughput measured in gigabits per second
(Gbps). Processing runs at near-constant per-request latencies. Latencies are measured
at the service, account, and file levels.
This scenario uses Azure Databricks clusters, which enable autoscaling by default.
Autoscaling enables Databricks during runtime to dynamically reallocate resources. With
autoscaling, you don't need to start a cluster to match a workload, which makes it easier
to achieve high cluster usage.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Protect assets by using controls on network traffic originating in Azure, between on-
premises and Azure hosted resources, and traffic to and from Azure. For instance, Azure
self-hosted integration runtime securely moves data from on-premises data storage to
Azure.
Use Azure Key Vault and Databricks scoped secret to access data in Azure Data Lake
Storage.
Azure services are either deployed in a secure virtual network or accessed using the
Azure Private Link feature. If necessary, row-level security provides granular access to
individual users in Azure Analysis Services or SQL Database.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
There are standard and premium Databricks pricing tiers. For this scenario, the standard
pricing tier is sufficient. If your application requires automatically scaling clusters to
handle larger workloads or interactive Databricks dashboards, you might need the
premium tier.
Costs related to this use case depend on the standard pricing for the following services
for your usage:
To estimate the cost of Azure products and configurations, visit the Azure pricing
calculator .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Azure Machine Learning
Introduction to Azure Data Lake Storage Gen2
Azure Databricks
Azure Data Factory
Related resources
Artificial intelligence
MLOps for Python models using Azure Machine Learning
Customer churn prediction using real-time analytics
Predict Length of Stay and Patient Flow
Batch scoring for deep learning
models using Azure Machine
Learning pipelines
Azure Logic Apps Azure Machine Learning Azure Role-based access control Azure Storage
This reference architecture shows how to apply neural-style transfer to a video, using
Azure Machine Learning. Style transfer is a deep learning technique that composes an
existing image in the style of another image. You can generalize this architecture for any
scenario that uses batch scoring with deep learning.
Architecture
Workflow
This architecture consists of the following components.
Compute
Azure Machine Learning uses pipelines to create reproducible and easy-to-manage
sequences of computation. It also offers a managed compute target (on which a
pipeline computation can run) called Azure Machine Learning Compute for training,
deploying, and scoring machine learning models.
Storage
Azure Blob Storage stores all the images (input images, style images, and output
images). Azure Machine Learning integrates with Blob Storage so that users don't have
to manually move data across compute platforms and blob storages. Blob Storage is
also cost-effective for the performance that this workload requires.
Trigger
Azure Logic Apps triggers the workflow. When the Logic App detects that a blob has
been added to the container, it triggers the Azure Machine Learning pipeline. Logic
Apps is a good fit for this reference architecture because it's an easy way to detect
changes to blob storage, with an easy process for changing the trigger.
1. Use FFmpeg to extract the audio file from the video footage, so that the audio
file can be stitched back into the output video later.
2. Use FFmpeg to break the video into individual frames. The frames are processed
independently, in parallel.
3. At this point, you can apply neural style transfer to each individual frame in
parallel.
4. After each frame has been processed, use FFmpeg to restitch the frames back
together.
5. Finally, reattach the audio file to the restitched footage.
Components
Azure Machine Learning
Azure Blob Storage
Azure Logic Apps
Solution details
This reference architecture is designed for workloads that are triggered by the presence
of new media in Azure storage.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.
GPUs aren't enabled by default in all regions. Make sure to select a region with GPUs
enabled. In addition, subscriptions have a default quota of zero cores for GPU-optimized
VMs. You can raise this quota by opening a support request. Make sure that your
subscription has enough quota to run your workload.
When you run a style transfer process as a batch job, the jobs that run primarily on
GPUs need to be parallelized across VMs. Two approaches are possible: You can create a
larger cluster using VMs that have a single GPU, or create a smaller cluster using VMs
with many GPUs.
For this workload, these two options have comparable performance. Using fewer VMs
with more GPUs per VM can help to reduce data movement. However, the data volume
per job for this workload isn't large, so you won't observe much throttling by Blob
Storage.
MPI step
When creating the Azure Machine Learning pipeline, one of the steps used to perform
parallel computation is the (message processing interface) MPI step. The MPI step helps
split the data evenly across the available nodes. The MPI step doesn't execute until all
the requested nodes are ready. Should one node fail or get preempted (if it's a low-
priority virtual machine), the MPI step will have to be rerun.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar. This section
contains considerations for building secure solutions.
Use Azure role-based access control (Azure RBAC) to limit users' access to only the
resources they need.
Provision two separate storage accounts. Store input and output data in the first
account. External users can be given access to this account. Store executable
scripts and output log files in the other account. External users should not have
access to this account. This separation ensures that external users can't modify any
executable files (to inject malicious code), and don't have access to log files, which
could hold sensitive information.
Malicious users can perform a DDoS attack on the job queue or inject
malformed poison messages in the job queue, causing the system to lock up or
causing dequeuing errors.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Compared to the storage and scheduling components, the compute resources used in
this reference architecture by far dominate in terms of costs. One of the main challenges
is effectively parallelizing the work across a cluster of GPU-enabled machines.
The Azure Machine Learning Compute cluster size can automatically scale up and down
depending on the jobs in the queue. You can enable autoscale programmatically by
setting the minimum and maximum nodes.
For work that doesn't require immediate processing, configure autoscale so the default
state (minimum) is a cluster of zero nodes. With this configuration, the cluster starts with
zero nodes and only scales up when it detects jobs in the queue. If the batch scoring
process happens only a few times a day or less, this setting results in significant cost
savings.
Autoscaling may not be appropriate for batch jobs that happen too close to each other.
The time that it takes for a cluster to spin up and spin down also incur a cost, so if a
batch workload begins only a few minutes after the previous job ends, it might be more
cost effective to keep the cluster running between jobs.
Azure Machine Learning Compute also supports low-priority virtual machines, which
allows you to run your computation on discounted virtual machines, with the caveat that
they may be preempted at any time. Low-priority virtual machines are ideal for non-
critical batch scoring workloads.
To check the overall state of the cluster, go to the Machine Learning service in the Azure
portal to check the state of the nodes in the cluster. If a node is inactive or a job has
failed, the error logs are saved to Blob Storage, and are also accessible in the Azure
portal.
You can also deploy a batch scoring architecture for deep learning models by using the
Azure Kubernetes Service. Follow the steps described in this GitHub repo .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Batch scoring of Spark models on Azure Databricks
Batch scoring of Python models on Azure
Batch scoring with R models to forecast sales
Related resources
Artificial intelligence architecture
What is Azure Machine Learning?
Azure Machine Learning pipelines
Batch scoring of Python models
on Azure
Azure Container Registry Azure Event Hubs Azure Machine Learning Azure SQL Database
This architecture guide shows how to build a scalable solution for batch scoring models
Azure Machine Learning. The solution can be used as a template and can generalize to
different problems.
Architecture
Workflow
This architecture guide is applicable for both streaming and static data, provided that
the ingestion process is adapted to the data type. The following steps and components
describe the ingestion of these two types of data.
Streaming data:
1. Streaming data originates from IoT Sensors, where new events are streamed at
frequent intervals.
2. Incoming streaming events are queued using Azure Event Hubs, and then pre-
processed using Azure Stream Analytics.
Azure Event Hubs. This message ingestion service can ingest millions of event
messages per second. In this architecture, sensors send a stream of data to
the event hub.
Azure Stream Analytics. An event-processing engine. A Stream Analytics job
reads the data streams from the event hub and performs stream processing.
Static data:
3. Static datasets can be stored as files within Azure Data Lake Storage or in tabular
form in Azure Synapse or Azure SQL Database .
4. Azure Data Factory can be used to aggregate or pre-process the stored dataset.
The remaining architecture, after data ingestion, is equal for both streaming and static
data, and consists of the following steps and components:
Components
Azure Event Hubs
Azure Stream Analytics
Azure SQL Database
Azure Synapse Analytics
Azure Data Lake Storage
Azure Data Factory
Azure Machine Learning
Azure Machine Learning Endpoints
Microsoft Power BI on Azure
Azure Web Apps
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Performance
For standard Python models, it's generally accepted that CPUs are sufficient to handle
the workload. This architecture uses CPUs. However, for deep learning workloads, GPUs
generally outperform CPUs by a considerable amount; a sizeable cluster of CPUs is
usually needed to get comparable performance.
For convenience in this scenario, one scoring task is submitted within a single Azure
Machine Learning pipeline step. However, it can be more efficient to score multiple data
chunks within the same pipeline step. In those cases, write custom code to read in
multiple datasets and execute the scoring script during a single-step execution.
Management
Monitor jobs. It's important to monitor the progress of running jobs. However, it
can be a challenge to monitor across a cluster of active nodes. To inspect the state
of the nodes in the cluster, use the Azure portal to manage the Machine
Learning workspace. If a node is inactive or a job has failed, the error logs are
saved to blob storage, and are also accessible in the Pipelines section. For richer
monitoring, connect logs to Application Insights, or run separate processes to poll
for the state of the cluster and its jobs.
Logging. Machine Learning logs all stdout/stderr to the associated Azure Storage
account. To easily view the log files, use a storage navigation tool such as Azure
Storage Explorer .
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
The most expensive components used in this architecture guide are the compute
resources. The compute cluster size scales up and down depending on the jobs in the
queue. Enable automatic scaling programmatically through the Python SDK by
modifying the compute's provisioning configuration. Or, use the Azure CLI to set the
automatic scaling parameters of the cluster.
For work that doesn't require immediate processing, configure the automatic scaling
formula so the default state (minimum) is a cluster of zero nodes. With this
configuration, the cluster starts with zero nodes and only scales up when it detects jobs
in the queue. If the batch scoring process happens only a few times a day or less, this
setting enables significant cost savings.
Automatic scaling might not be appropriate for batch jobs that occur too close to each
other. Because the time that it takes for a cluster to spin up and spin down incurs a cost,
if a batch workload begins only a few minutes after the previous job ends, it might be
more cost effective to keep the cluster running between jobs. This strategy depends on
whether scoring processes are scheduled to run at a high frequency (every hour, for
example), or less frequently (once a month, for example).
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
Product documentation:
Related resources
Artificial intelligence (AI) - Architectural overview
Batch scoring for deep learning models using Azure Machine Learning pipelines
Batch scoring of Spark models on Azure Databricks
MLOps for Python models using Azure Machine Learning
Batch scoring with R models to
forecast sales
Azure Batch Azure Blob Storage Azure Container Instances Azure Logic Apps Azure Machine Learning
This reference architecture shows how to perform batch scoring with R models using
Azure Batch. Azure Batch works well with intrinsically parallel workloads and includes job
scheduling and compute management. Batch inference (scoring) is widely used to
segment customers, forecast sales, predict customer behaviors, predict maintenance, or
improve cyber security.
Workflow
This architecture consists of the following components.
Azure Batch runs forecast generation jobs in parallel on a cluster of virtual machines.
Predictions are made using pre-trained machine learning models implemented in R.
Azure Batch can automatically scale the number of VMs based on the number of jobs
submitted to the cluster. On each node, an R script runs within a Docker container to
score data and generate forecasts.
Azure Blob Storage stores the input data, the pre-trained machine learning models, and
the forecast results. It delivers cost-effective storage for the performance that this
workload requires.
Azure Container Instances provides serverless compute on demand. In this case, a
container instance is deployed on a schedule to trigger the Batch jobs that generate the
forecasts. The Batch jobs are triggered from an R script using the doAzureParallel
package. The container instance automatically shuts down once the jobs have finished.
Azure Logic Apps triggers the entire workflow by deploying the container instances on a
schedule. An Azure Container Instances connector in Logic Apps allows an instance to
be deployed upon a range of trigger events.
Components
Azure Batch
Azure Blob Storage
Azure Container Instances
Azure Logic Apps
Solution details
Although the following scenario is based on retail store sales forecasting, its architecture
can be generalized for any scenario requiring the generation of predictions on a larger
scale using R models. A reference implementation for this architecture is available on
GitHub .
1. An Azure Logic App triggers the forecast generation process once per week.
2. The logic app starts an Azure Container Instance running the scheduler Docker
container, which triggers the scoring jobs on the Batch cluster.
3. Scoring jobs run in parallel across the nodes of the Batch cluster. Each node:
The following figure shows the forecasted sales for four products (SKUs) in one store.
The black line is the sales history, the dashed line is the median (q50) forecast, the pink
band represents the 25th and 75th percentiles, and the blue band represents the 50th
and 95th percentiles.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Performance
Containerized deployment
With this architecture, all R scripts run within Docker containers. Using containers
ensures that the scripts run in a consistent environment every time, with the same R
version and packages versions. Separate Docker images are used for the scheduler and
worker containers, because each has a different set of R package dependencies.
Each node of the Batch cluster runs the worker container, which executes the scoring
script.
How much data can be loaded and processed in the memory of a single node.
The overhead of starting each batch job.
The overhead of loading the R models.
In the scenario used for this example, the model objects are large, and it takes only a
few seconds to generate a forecast for individual products. For this reason, you can
group the products and execute a single Batch job per node. A loop within each job
generates forecasts for the products sequentially. This method is the most efficient way
to parallelize this particular workload. It avoids the overhead of starting many smaller
Batch jobs and repeatedly loading the R models.
An alternative approach is to trigger one Batch job per product. Azure Batch
automatically forms a queue of jobs and submits them to be executed on the cluster as
nodes become available. Use automatic scaling to adjust the number of nodes in the
cluster, depending on the number of jobs. This approach is useful if it takes a relatively
long time to complete each scoring operation, which justifies the overhead of starting
the jobs and reloading the model objects. This approach is also simpler to implement
and gives you the flexibility to use automatic scaling, an important consideration if the
size of the total workload isn't known in advance.
Monitor Azure Batch jobs
Monitor and terminate Batch jobs from the Jobs pane of the Batch account in the Azure
portal. Monitor the batch cluster, including the state of individual nodes, from the Pools
pane.
To quickly debug Batch jobs during development, view the logs in your local R session.
For more information, see using the Configure and submit training runs.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
The compute resources used in this reference architecture are the most costly
components. For this scenario, a cluster of fixed size is created whenever the job is
triggered and then shut down after the job has completed. Cost is incurred only while
the cluster nodes are starting, running, or shutting down. This approach is suitable for a
scenario where the compute resources required to generate the forecasts remain
relatively constant from job to job.
In scenarios where the amount of compute required to complete the job isn't known in
advance, it may be more suitable to use automatic scaling. With this approach, the size
of the cluster is scaled up or down depending on the size of the job. Azure Batch
supports a range of autoscale formulae, which you can set when defining the cluster
using the doAzureParallel API.
For some scenarios, the time between jobs may be too short to shut down and start up
the cluster. In these cases, keep the cluster running between jobs if appropriate.
Azure Batch and doAzureParallel support the use of low-priority VMs. These VMs come
with a significant discount but risk being appropriated by other higher priority
workloads. Therefore, the use of low-priority VMs isn't recommended for critical
production workloads. However, they're useful for experimental or development
workloads.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
What is Azure Machine Learning?
Azure Machine Learning pipelines
Related resources
Artificial intelligence architecture
Batch scoring of Spark models on Azure Databricks
Batch scoring of Python models on Azure
Batch scoring for deep learning models
Batch scoring of Spark models on
Azure Databricks
Azure Active Directory Azure Databricks Azure Data Factory Azure Blob Storage
This reference architecture shows how to build a scalable solution for batch scoring an
Apache Spark classification model on a schedule using Azure Databricks. Azure
Databricks is an Apache Spark-based analytics platform optimized for Azure. Azure
Databricks offers three environments for developing data intensive applications:
Databricks SQL, Databricks Data Science & Engineering, and Databricks Machine
Learning. Databricks Machine Learning is an integrated end-to-end machine learning
environment incorporating managed services for experiment tracking, model training,
feature development and management, and feature and model serving. You can use this
reference architecture as a template that can be generalized to other scenarios. A
reference implementation for this architecture is available on GitHub .
Apache® and Apache Spark® are either registered trademarks or trademarks of the
Apache Software Foundation in the United States and/or other countries. No endorsement
by The Apache Software Foundation is implied by the use of these marks.
Architecture
Azure Databricks
Databricks store
Scoring data
Batch jobs scheduler
Results data
Microsoft
Azure
Workflow
The architecture defines a data flow that is entirely contained within Azure Databricks
based on a set of sequentially executed notebooks . It consists of the following
components:
Data files. The reference implementation uses a simulated data set contained in five
static data files.
Ingestion. The data ingestion notebook downloads the input data files into a collection
of Databricks data sets. In a real-world scenario, data from IoT devices would stream
onto Databricks-accessible storage such as Azure SQL or Azure Blob storage. Databricks
supports multiple data sources .
Training pipeline. This notebook executes the feature engineering notebook to create
an analysis data set from the ingested data. It then executes a model building notebook
that trains the machine learning model using the Apache Spark MLlib scalable
machine learning library.
Scoring pipeline. This notebook executes the feature engineering notebook to create
scoring data set from the ingested data and executes the scoring notebook. The scoring
notebook uses the trained Spark MLlib model to generate predictions for the
observations in the scoring data set. The predictions are stored in the results store, a
new data set on the Databricks data store.
Scheduler. A scheduled Databricks job handles batch scoring with the Spark model.
The job executes the scoring pipeline notebook, passing variable arguments through
notebook parameters to specify the details for constructing the scoring data set and
where to store the results data set.
Solution details
The scenario is constructed as a pipeline flow. Each notebook is optimized to perform in
a batch setting for each of the operations: ingestion, feature engineering, model
building, and model scorings. The feature engineering notebook is designed to
generate a general data set for any of the training, calibration, testing, or scoring
operations. In this scenario, we use a temporal split strategy for these operations, so the
notebook parameters are used to set date-range filtering.
Because the scenario creates a batch pipeline, we provide a set of optional examination
notebooks to explore the output of the pipeline notebooks. You can find these
notebooks in the GitHub repository notebooks folder :
1a_raw-data_exploring.ipynb
2a_feature_exploration.ipynb
2b_model_testing.ipynb
3b_model_scoring_evaluation.ipynb
A predictive maintenance model collects data from the machines and retains historical
examples of component failures. The model can then be used to monitor the current
state of the components and predict if a given component will fail soon. For common
use cases and modeling approaches, see Azure AI guide for predictive maintenance
solutions.
This reference architecture is designed for workloads that are triggered by the presence
of new data from the component machines. Processing involves the following steps:
1. Ingest the data from the external data store onto an Azure Databricks data store.
2. Train a machine learning model by transforming the data into a training data set,
then building a Spark MLlib model. MLlib consists of most common machine
learning algorithms and utilities optimized to take advantage of Spark data
scalability capabilities.
Recommendations
Databricks is set up so you can load and deploy your trained models to make
predictions with new data. Databricks also provides other advantages:
To interact with the Azure Databricks service, use the Databricks Workspace interface
in a web browser or the command-line interface (CLI). Access the Databricks CLI from
any platform that supports Python 2.7.9 to 3.6.
Monitor job execution through the Databricks user interface, the data store, or the
Databricks CLI as necessary. Monitor the cluster using the event log and other
metrics that Databricks provides.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Performance
An Azure Databricks cluster enables autoscaling by default so that during runtime,
Databricks dynamically reallocates workers to account for the characteristics of your job.
Certain parts of your pipeline may be more computationally demanding than others.
Databricks adds extra workers during these phases of your job (and removes them when
they're no longer needed). Autoscaling makes it easier to achieve high cluster utilization,
because you don't need to provision the cluster to match a workload.
Develop more complex scheduled pipelines by using Azure Data Factory with Azure
Databricks.
Storage
In this reference implementation, the data is stored directly within Databricks storage for
simplicity. In a production setting, however, you can store the data on cloud data
storage such as Azure Blob Storage . Databricks also supports Azure Data Lake
Store , Azure Synapse Analytics , Azure Cosmos DB , Apache Kafka , and Apache
Hadoop .
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
In general, use the Azure pricing calculator to estimate costs. Other considerations are
described in the Cost section in Microsoft Azure Well-Architected Framework.
Azure Databricks is a premium Spark offering with an associated cost. In addition, there
are standard and premium Databricks pricing tiers .
For this scenario, the standard pricing tier is sufficient. However, if your specific
application requires automatically scaling clusters to handle larger workloads or
interactive Databricks dashboards, the premium level could increase costs further.
The solution notebooks can run on any Spark-based platform with minimal edits to
remove the Databricks-specific packages. See the following similar solutions for various
Azure platforms:
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Perform data science with Azure Databricks
Deploy batch inference pipelines with Azure Machine Learning
Tutorial: Build an Azure Machine Learning pipeline for batch scoring
Related resources
Build a Real-time Recommendation API on Azure
Batch scoring for deep learning models using Azure Machine Learning pipelines
Batch scoring of Python Models on Azure
Build an enterprise-grade
conversational bot
Azure AI Bot Service Azure AI services
Architecture
Security and governance
Azure AD Key Vault
Identity and Keys and secrets
access control
2 3
Authenticate,
Token, keys
authorize Data ETL Raw data
Bot logic and UX Bot cognition and intelligence Azure Functions
1 Request Bot Service 4 Queries LUIS Web App Custom serverless
Input channels, Custom model for compute Structured
Intents and entities Data Data
authentication unstructured Q&A CRM, SQL, etc.
Data Factory
Bot Logic QnA Maker Scheduled ETL
User 7 Response 5 Results Azure Search
pipelines
Custom bot code Knowledge base Search index
(FAQs)
Unstructured
Logic Apps
Improved bot intelligence FAQs, PDFs,
Data connectors DOCs, etc.
6
End-to-end testing, Quality assurance and enhancements
Conversations,
new features
feedback, logs Enhanced data collection
Azure DevOps, VS Code
Workflow
The architecture shown here uses the following Azure services. Your own bot may not
use all of these services, or may incorporate additional services.
Data ingestion
The bot will rely on raw data that must be ingested and prepared. Consider any of the
following options to orchestrate this process:
Azure Data Factory. Data Factory orchestrates and automates data movement and
data transformation.
Logic Apps. Logic Apps is a serverless platform for building workflows that
integrate applications, data, and services. Logic Apps provides data connectors for
many applications, including Office 365.
Azure Functions. You can use Azure Functions to write custom serverless code that
is invoked by a trigger — for example, whenever a document is added to blob
storage or Azure Cosmos DB.
Application Insights. Use Application Insights to log the bot's application metrics
for monitoring, diagnostic, and analytical purposes.
Azure Blob Storage. Blob storage is optimized for storing massive amounts of
unstructured data.
Azure Cosmos DB. Azure Cosmos DB is well-suited for storing semi-structured log
data such as conversations.
Power BI. Use Power BI to create monitoring dashboards for your bot.
Security and governance
Microsoft Entra ID. Users will authenticate through an identity provider such as
Microsoft Entra ID. The Bot Service handles the authentication flow and OAuth
token management. See Add authentication to your bot via Azure Bot Service.
Azure Key Vault. Store credentials and other secrets using Key Vault.
Components
Bot Framework Service
Azure App Service
Azure Cognitive Services
Azure Search
Azure Data Factory
Azure Logic Apps
Azure Functions
Application Insights is a feature of Azure Monitor
Azure Blob Storage
Azure Cosmos DB
Microsoft Entra ID
Azure Key Vault
Scenario details
Each bot is different, but there are some common patterns, workflows, and technologies
to be aware of. Especially for a bot to serve enterprise workloads, there are many design
considerations beyond just the core functionality.
The best practice utility samples used in this architecture are fully open-sourced and
available on GitHub .
Recommendations
At a high level, a conversational bot can be divided into the bot functionality (the
"brain") and a set of surrounding requirements (the "body"). The brain includes the
domain-aware components, including the bot logic and ML capabilities. Other
components are domain agnostic and address non-functional requirements such as
CI/CD, quality assurance, and security.
Before getting into the specifics of this architecture, let's start with the data flow
through each subcomponent of the design. The data flow includes user-initiated and
system-initiated data flows.
User message. Once authenticated, the user sends a message to the bot. The bot reads
the message and routes it to a natural language understanding service such as LUIS.
This step gets the intents (what the user wants to do) and entities (what things the user
is interested in). The bot then builds a query that it passes to a service that serves
information, such as Azure Search for document retrieval, QnA Maker for FAQs, or a
custom knowledge base. The bot uses these results to construct a response. To give the
best result for a given query, the bot might make several back-and-forth calls to these
remote services.
Response. At this point, the bot has determined the best response and sends it to the
user. If the confidence score of the best-matched answer is low, the response might be a
disambiguation question or an acknowledgment that the bot could not reply
adequately.
Logging. When a user request is received or a response is sent, all conversation actions
should be logged to a logging store, along with performance metrics and general errors
from external services. These logs will be useful later when diagnosing issues and
improving the system.
Feedback. Another good practice is to collect user feedback and satisfaction scores. As a
follow up to the bot's final response, the bot should ask the user to rate their
satisfaction with the reply. Feedback can help you to solve the cold start problem of
natural language understanding, and continually improve the accuracy of responses.
Data in the intermediary store is then indexed into Azure Search for document retrieval,
loaded into QnA Maker to create question and answer pairs, or loaded into a custom
web app for unstructured text processing. The data is also used to train a LUIS model for
intent and entity extraction.
Quality assurance. The conversation logs are used to diagnose and fix bugs, provide
insight into how the bot is being used, and track overall performance. Feedback data is
useful for retraining the AI models to improve bot performance.
Building a bot
Before you even write a single line of code, it's important to write a functional
specification so the development team has a clear idea of what the bot is expected to
do. The specification should include a reasonably comprehensive list of user inputs and
expected bot responses in various knowledge domains. This living document will be an
invaluable guide for developing and testing your bot.
Ingest data
Next, identify the data sources that will enable the bot to interact intelligently with users.
As mentioned earlier, these data sources could contain structured, semi-structured, or
unstructured data sets. When you're getting started, a good approach is to make a one-
off copy of the data to a central store, such as Azure Cosmos DB or Azure Storage. As
you progress, you should create an automated data ingestion pipeline to keep this data
current. Options for an automated ingestion pipeline include Data Factory, Functions,
and Logic Apps. Depending on the data stores and the schemas, you might use a
combination of these approaches.
As you get started, it's reasonable to use the Azure portal to manually create Azure
resources. Later on, you should put more thought into automating the deployment of
these resources.
Once you have a specification and some data, it's time to start making your bot into
reality. Let's focus on the core bot logic. This is the code that handles the conversation
with the user, including the routing logic, disambiguation logic, and logging. Start by
familiarizing yourself with the Bot Framework , including:
You can use cards to include buttons, images, carousels, and menus.
A bot can support speech.
You can even embed your bot in an app or website and use the capabilities of the
app hosting it.
To get started, you can build your bot online using the Azure Bot Service, selecting from
the available C# and Node.js templates. As your bot gets more sophisticated, however,
you will need to create your bot locally then deploy it to the web. Choose an IDE, such
as Visual Studio or Visual Studio Code, and a programming language. SDKs are available
for the following languages:
C#
JavaScript
Java (preview)
Python (preview)
As a starting point, you can download the source code for the bot you created using the
Azure Bot Service. You can also find sample code , from simple echo bots to more
sophisticated bots that integrate with various AI services.
LUIS is specifically designed to understand user intents and entities. You train it
with a moderately sized collection of relevant user input and desired responses,
and it returns the intents and entities for a user's given message.
Azure Search can work alongside LUIS. Using Search, you create searchable indexes
over all relevant data. The bot queries these indexes for the entities extracted by
LUIS. Azure Search also supports synonyms, which can widen the net of correct
word mappings.
QnA Maker is another service that is designed to return answers for given
questions. It's typically trained over semi-structured data such as FAQs.
Your bot can use other AI services to further enrich the user experience. The Cognitive
Services suite of pre-built AI services (which includes LUIS and QnA Maker) has
services for vision, speech, language, search, and location. You can quickly add
functionality such as language translation, spell checking, sentiment analysis, OCR,
location awareness, and content moderation. These services can be wired up as
middleware modules in your bot to interact more naturally and intelligently with the
user.
Another option is to integrate your own custom AI service. This approach is more
complex, but gives you complete flexibility in terms of the machine learning algorithm,
training, and model. For example, you could implement your own topic modeling and
use algorithm such as LDA to find similar or relevant documents. A good approach is
to expose your custom AI solution as a web service endpoint, and call the endpoint from
the core bot logic. The web service could be hosted in App Service or in a cluster of
VMs. Azure Machine Learning provides a number of services and libraries to assist you
in training and deploying your models.
Feedback. It's also important to understand how satisfied users are with their bot
interactions. If you have a record of user feedback, you can use this data to focus your
efforts on improving certain interactions and retraining the AI models for improved
performance. Use the feedback to retrain the models, such as LUIS, in your system.
Testing. Testing a bot involves unit tests, integration tests, regression tests, and
functional tests. For testing, we recommend recording real HTTP responses from
external services, such as Azure Search or QnA Maker, so they can be played back
during unit testing without needing to make real network calls to external services.
7 Note
To jump-start your development in these areas, look at the Botbuilder Utils for
JavaScript . This repo contains sample utility code for bots built with Microsoft
Bot Framework v4 and running Node.js. It includes the following packages:
Azure Cosmos DB Logging Store . Shows how to store and query bot logs
in Azure Cosmos DB.
Application Insights Logging Store . Shows how to store and query bot logs
in Application Insights.
Feedback Collection Middleware . Sample middleware that provides a bot
user feedback-request mechanism.
Http Test Recorder . Records HTTP traffic from services external to the bot. It
comes pre-built with support for LUIS, Azure Search, and QnAMaker, but
extensions are available to support any service. This helps you automate bot
testing.
These packages are provided as utility sample code, and come with no guarantee
of support or updates.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Availability
As you roll out new features or bug fixes to your bot, it's best to use multiple
deployment environments, such as staging and production. Using deployment slots
from Azure DevOps allows you to do this with zero downtime. You can test your latest
upgrades in the staging environment before swapping them to the production
environment. In terms of handling load, App Service is designed to scale up or out
manually or automatically. Because your bot is hosted in Microsoft's global datacenter
infrastructure, the App Service SLA promises high availability.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
As with any other application, the bot can be designed to handle sensitive data.
Therefore, restrict who can sign in and use the bot. Also limit which data can be
accessed, based on the user's identity or role. Use Microsoft Entra ID for identity and
access control and Key Vault to manage keys and secrets.
DevOps
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Use the Azure pricing calculator to estimate costs. Here are some other
considerations.
Bot application
In this architecture, the main cost driver is the Azure App Service in which the bot
application logic is hosted. Choose an App Service plan tier that best suits your needs.
Here are some recommendations:
Use Free and Shared (preview) tiers for testing purposes because the shared
resources cannot scale out.
Run your production workload on Basic, Standard, and Premium tiers because the
app runs on dedicated virtual machine instances and has allocated resources that
can scale out. App Service plans are billed on a per second basis.
You are charged for the instances in the App Service plan, even when the app is
stopped. Delete plans that you don't intend to use long term, such as test deployments.
For more information, see How much does my App Service plan cost?.
Data ingestion
Azure Data Factory
In this architecture, Data Factory automates the data ingestion pipeline. Explore a
range of data integration capabilities to fit your budget needs, from managed SQL
Server Integration Services for seamless migration of SQL Server projects to the
cloud (cost effective option), to large-scale, serverless data pipelines for integrating
data of all shapes and sizes.
Azure Functions
Logic Apps
Logic apps pricing works on the pay-as-you-go model. Logic apps have a pay-as-you-
go pricing model. Triggers, actions, and connector executions are metered each time a
logic app runs. All successful and unsuccessful actions, including triggers, are considered
as executions.
For instance, your logic app processes 1000 messages a day from Azure Service Bus. A
workflow of five actions will cost less than $6. For more information, see Logic Apps
pricing .
For other cost considerations, see the Cost section in Microsoft Azure Well-Architected
Framework.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
Review the Virtual Assistant template to quickly get started building conversational
bots.
Product documentation:
This reference architecture shows how to train a recommendation model by using Azure
Databricks, and then deploy the model as an API by using Azure Cosmos DB, Azure
Machine Learning, and Azure Kubernetes Service (AKS). For a reference implementation
of this architecture see Building a Real-time Recommendation API on GitHub.
Architecture
Dataflow
1. Track user behaviors. For example, a back-end service might log when a user rates
a movie or clicks a product or news article.
2. Load the data into Azure Databricks from an available data source.
3. Prepare the data and split it into training and testing sets to train the model. (This
guide describes options for splitting data.)
4. Fit the Spark Collaborative Filtering model to the data.
5. Evaluate the quality of the model using rating and ranking metrics. (This guide
provides details about the metrics that you can use to evaluate your
recommender.)
6. Precompute the top 10 recommendations per user and store as a cache in Azure
Cosmos DB.
7. Deploy an API service to AKS using the Machine Learning APIs to containerize and
deploy the API.
8. When the back-end service gets a request from a user, call the recommendations
API hosted in AKS to get the top 10 recommendations and display them to the
user.
Components
Azure Databricks . Databricks is a development environment used to prepare
input data and train the recommender model on a Spark cluster. Azure Databricks
also provides an interactive workspace to run and collaborate on notebooks for
any data processing or machine learning tasks.
Azure Kubernetes Service (AKS). AKS is used to deploy and operationalize a
machine learning model service API on a Kubernetes cluster. AKS hosts the
containerized model, providing scalability that meets your throughput
requirements, identity and access management, and logging and health
monitoring.
Azure Cosmos DB . Azure Cosmos DB is a globally distributed database service
used to store the top 10 recommended movies for each user. Azure Cosmos DB is
well-suited for this scenario, because it provides low latency (10 ms at 99th
percentile) to read the top recommended items for a given user.
Machine Learning . This service is used to track and manage machine learning
models, and then package and deploy these models to a scalable AKS
environment.
Microsoft Recommenders . This open-source repository contains utility code and
samples to help users get started in building, evaluating, and operationalizing a
recommender system.
Scenario details
This architecture can be generalized for most recommendation engine scenarios,
including recommendations for products, movies, and news.
This solution is optimized for the retail industry and for the media and entertainment
industries.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.
The combination of AKS and Azure Cosmos DB enables this architecture to provide a
good starting point to provide recommendations for a medium-sized workload with
minimal overhead. Under a load test with 200 concurrent users, this architecture
provides recommendations at a median latency of about 60 ms and performs at a
throughput of 180 requests per second. The load test was run against the default
deployment configuration (a 3x D3 v2 AKS cluster with 12 vCPUs, 42 GB of memory, and
11,000 Request Units (RUs) per second provisioned for Azure Cosmos DB).
Azure Cosmos DB is recommended for its turnkey global distribution and usefulness in
meeting any database requirements your app has. To reduce latency slightly, consider
using Azure Cache for Redis instead of Azure Cosmos DB to serve lookups. Azure Cache
for Redis can improve performance of systems that rely heavily on data in back-end
stores.
Scalability
If you don't plan to use Spark, or you have a smaller workload that doesn't need
distribution, consider using a Data Science Virtual Machine (DSVM) instead of Azure
Databricks. A DSVM is an Azure virtual machine with deep learning frameworks and
tools for machine learning and data science. As with Azure Databricks, any model you
create in a DSVM can be operationalized as a service on AKS via Machine Learning.
During training, either provision a larger fixed-size Spark cluster in Azure Databricks, or
configure autoscaling. When autoscaling is enabled, Databricks monitors the load on
your cluster and scales up and down as needed. Provision or scale out a larger cluster if
you have a large data size and you want to reduce the amount of time it takes for data
preparation or modeling tasks.
Scale the AKS cluster to meet your performance and throughput requirements. Take care
to scale up the number of pods to fully utilize the cluster, and to scale the nodes of the
cluster to meet the demand of your service. You can also set autoscaling on an AKS
cluster. For more information, see Deploy a model to an Azure Kubernetes Service
cluster.
To manage Azure Cosmos DB performance, estimate the number of reads required per
second, and provision the number of RUs per second (throughput) needed. Use best
practices for partitioning and horizontal scaling.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Manage the Azure Databricks costs by retraining less frequently and turning off the
Spark cluster when not in use. The AKS and Azure Cosmos DB costs are tied to the
throughput and performance required by your site and will scale up and down
depending on the volume of traffic to your site.
alize/als_movie_o16n.ipynb
d. Click Import.
8. Open the notebook within Azure Databricks and attach the configured cluster.
9. Run the notebook to create the Azure resources required to create a
recommendation API that provides the top-10 movie recommendations for a given
user.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
Building a Real-time Recommendation API
What is Azure Databricks?
Azure Kubernetes Service
Welcome to Azure Cosmos DB
What is Azure Machine Learning?
Related resources
Batch scoring of Spark models on Azure Databricks
Build a content-based recommendation system
Personalization using Cosmos DB
Retail assistant with visual capabilities
Create personalized marketing solutions in near real time
Personalized offers
Build and deploy a social media
analytics solution
Azure AI services Azure Synapse Analytics Azure Machine Learning Azure Data Lake Power BI Embedded
To best address customer needs, organizations need to extract insights from social
media about their customers. This article presents a solution for analyzing news and
social media data. The solution extends the Azure Social Media Analytics Solution
Accelerator , which gives developers the resources needed to build and deploy a social
media monitoring platform on Azure in a few hours. That platform collects social media
and website data and presents the data in a format that supports the business decision–
making process.
Apache®, Apache Spark , and the flame logo are either registered trademarks or
trademarks of the Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is implied by the use of
these marks.
Architecture
Dataflow
1. Azure Synapse Analytics pipelines ingest external data and store that data in Azure
Data Lake. One pipeline ingests data from news APIs. The other pipeline ingests
data from the Twitter API.
2. Apache Spark pools in Azure Synapse Analytics are used to process and enrich the
data.
Azure Cognitive Service for Language, for named entity recognition (NER),
key phrase extraction, and sentiment analysis
Azure Cognitive Services Translator, to translate text
Azure Maps, to link data to geographical coordinates
5. A serverless SQL pool in Azure Synapse Analytics makes the enriched data
available to Power BI.
10. A managed online endpoint is used for online, real-time inferencing, for instance,
on a mobile app (A). Alternatively, a batch endpoint is used for offline model
inferencing (B).
Components
Azure Synapse Analytics is an integrated analytics service that accelerates time
to insight across data warehouses and big data systems.
Translator helps you to translate text instantly or in batches across more than
100 languages. This service uses the latest innovations in machine translation.
Translator supports a wide range of use cases, such as translation for call centers,
multilingual conversational agents, and in-app communication. For the languages
that Translator supports, see Translation.
Azure Maps is a suite of geospatial services that help you incorporate location-
based data into web and mobile solutions. You can use the location and map data
to generate insights, inform data-driven decisions, enhance security, and improve
customer experiences. This solution uses Azure Maps to link news and posts to
geographical coordinates.
App Service provides a framework for building, deploying, and scaling web apps.
The Web Apps feature is a service for hosting web applications, REST APIs, and
mobile back ends.
Power BI is a collection of analytics services and apps. You can use Power BI to
connect and display unrelated sources of data.
Alternatives
You can simplify this solution by eliminating Machine Learning and the custom machine
learning models, as the following diagram shows. For more information, see Deploy this
scenario, later in this article.
Scenario details
Marketing campaigns are about more than the message that you deliver. When and
how you deliver that message is just as important. Without a data-driven, analytical
approach, campaigns can easily miss opportunities or struggle to gain traction. Those
campaigns are often based on social media analysis, which has become increasingly
important for companies and organizations around the world. Social media analysis is a
powerful tool that you can use to receive instant feedback on products and services,
improve interactions with customers to increase customer satisfaction, keep up with the
competition, and more. Companies often lack efficient, viable ways to monitor social
media conversations. As a result, they miss opportunities to use these insights to inform
their strategies and plans.
This article's solution benefits a wide spectrum of social media and news analysis
applications. By deploying the solution instead of manually deploying its resources, you
can reduce your time to market. You can also:
For instance, to see the latest discussions about Satya Nadella, you enter his name in a
query. The solution then accesses news APIs and the Twitter API to provide information
about him from around the web.
Potential use cases
By extracting information about your customers from social media, you can enhance
customer experiences, increase customer satisfaction, gain new leads, and prevent
customer churn. These applications of social media analytics fall into three main areas:
Marketing is an integral part of every organization. As a result, you can use this social
media analytics solution for these use cases in various industries:
Retail
Finance
Manufacturing
Healthcare
Government
Energy
Telecommunications
Automotive
Nonprofit
Gaming
Media and entertainment
Travel, including hospitality and restaurants
Facilities, including real estate
Sports
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
Use Azure Monitor and Application Insights to monitor the health of Azure
resources.
Review the following resiliency considerations before you implement this solution:
Azure Synapse Analytics
App Service
For more information about resiliency in Azure, see Design reliable Azure
applications.
For availability guarantees of various Azure components, see the following service
level agreements (SLAs):
SLA for Azure Synapse Analytics
SLA for Storage Accounts
SLA for Azure Maps
SLA for Azure Cognitive Services
SLA for Azure Machine Learning
SLA for App Service
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
To estimate the cost of this solution, use the Azure pricing calculator .
Operational excellence
Operational excellence covers the operations processes that deploy an application and
keep it running in production. For more information, see Overview of the operational
excellence pillar.
Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands
placed on it by users in an efficient manner. For more information, see Performance
efficiency pillar overview.
For information about Spark pool scaling and node sizes, see Apache Spark pool
configurations in Azure Synapse Analytics.
You can scale Machine Learning training pipelines up and down based on data size
and other configuration parameters.
Serverless SQL pools are available on demand. They don't require scaling up,
down, in, or out.
Azure Synapse Analytics supports Apache Spark 3.1.2, which delivers significant
performance improvements over its predecessors .
Prerequisites
To use the solution accelerator, you need access to an Azure subscription .
A basic understanding of Azure Synapse Analytics, Azure Cognitive Services, Azure
Maps, and Power BI is helpful but not required.
A news API account is required.
A Twitter developer account with Elevated access to Twitter API features is
required.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributor.
Principal author:
Next steps
What is Azure Synapse Analytics?
Azure Machine Learning documentation
What are Azure Cognitive Services?
Azure Cognitive Service for Language documentation
What is Azure Cognitive Services Translator?
What is Azure Maps?
What is Power BI?
Tutorial: Sentiment analysis with Cognitive Services in Azure Synapse Analytics
Tutorial: Text Analytics with Cognitive Service in Azure Synapse Analytics
Related resources
Artificial intelligence (AI) architecture design
Choose a Microsoft cognitive services technology
Optimize marketing with machine learning
Spaceborne data analysis with Azure Synapse Analytics
Implement logging and
monitoring for Azure OpenAI
models
Azure AI services Azure API Management Azure Monitor Azure Active Directory
This solution provides comprehensive logging and monitoring and enhanced security
for enterprise deployments of the Azure OpenAI Service API. The solution enables
advanced logging capabilities for tracking API usage and performance and robust
security measures to help protect sensitive data and help prevent malicious activity.
Architecture
Workflow
1. Client applications access Azure OpenAI endpoints to perform text generation
(completions) and model training (fine-tuning).
7 Note
Load balancing of stateful operations like model fine-tuning, deployments,
and inference of fine-tuned models isn't supported.
3. Azure API Management enables security controls and auditing and monitoring of
the Azure OpenAI models.
a. In API Management, enhanced-security access is granted via Microsoft Entra
groups with subscription-based access permissions.
b. Auditing is enabled for all interactions with the models via Azure Monitor
request logging.
c. Monitoring provides detailed Azure OpenAI model usage KPIs and metrics,
including prompt information and token statistics for usage traceability.
4. API Management connects to all Azure resources via Azure Private Link. This
configuration provides enhanced security for all traffic via private endpoints and
contains traffic in the private network.
5. Multiple Azure OpenAI instances enable scale-out of API usage to ensure high
availability and disaster recovery for the service.
Components
Application Gateway . Application load balancer to help ensure that all users of
the Azure OpenAI APIs get the fastest response and highest throughput for model
completions.
API Management . API management platform for accessing back-end Azure
OpenAI endpoints. Provides monitoring and logging that's not available natively in
Azure OpenAI.
Azure Virtual Network . Private network infrastructure in the cloud. Provides
network isolation so that all network traffic for models is routed privately to Azure
OpenAI.
Azure OpenAI . Service that hosts models and provides generative model
completion outputs.
Monitor . End-to-end observability for applications. Provides access to
application logs via Kusto Query Language. Also enables dashboard reports and
monitoring and alerting capabilities.
Azure Key Vault . Enhanced-security storage for keys and secrets that are used by
applications.
Azure Storage . Application storage in the cloud. Provides Azure OpenAI with
accessibility to model training artifacts.
Microsoft Entra ID . Enhanced-security identity manager. Enables user
authentication and authorization to the application and to platform services that
support the application. Also provides Group Policy to ensure that the principle of
least privilege is applied to all users.
Alternatives
Azure OpenAI provides native logging and monitoring. You can use this native
functionality to track telemetry of the service, but the default cognitive service logging
doesn't track or record inputs and outputs of the service, like prompts, tokens, and
models. These metrics are especially important for compliance and to ensure that the
service operates as expected. Also, by tracking interactions with the large language
models deployed to Azure OpenAI, you can analyze how your organization is using the
service to identify cost and usage patterns that can help inform decisions on scaling and
resource allocation.
The following table provides a comparison of the metrics provided by the default Azure
OpenAI logging and those provided by this solution.
Request count x x
Latency x x
Model utilization x
Token utilization x x
(input/output)
characters)
Deployment operations x x
Scenario details
Large enterprises that use generative AI models need to implement auditing and
logging of the use of these models to ensure responsible use and corporate compliance.
This solution provides enterprise-level logging and monitoring for all interactions with
AI models to mitigate harmful use of the models and help ensure that security and
compliance standards are met. The solution integrates with existing APIs for Azure
OpenAI with little modification to take advantage of existing code bases. Administrators
can also monitor service usage for reporting.
ApiManagementGatewayLogs
| where OperationId == 'completions_create'
| extend modelkey = substring(parse_json(BackendResponseBody)['model'], 0,
indexof(parse_json(BackendResponseBody)['model'], '-', 0, -1, 2))
| extend model = tostring(parse_json(BackendResponseBody)['model'])
| extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])
['prompt_tokens']
| extend completiontokens = parse_json(parse_json(BackendResponseBody)
['usage'])['completion_tokens']
| extend totaltokens = parse_json(parse_json(BackendResponseBody)['usage'])
['total_tokens']
| extend ip = CallerIpAddress
| summarize
sum(todecimal(prompttokens)),
sum(todecimal(completiontokens)),
sum(todecimal(totaltokens)),
avg(todecimal(totaltokens))
by ip, model
Output:
ApiManagementGatewayLogs
| where OperationId == 'completions_create'
| extend model = tostring(parse_json(BackendResponseBody)['model'])
| extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])
['prompt_tokens']
| extend prompttext = substring(parse_json(parse_json(BackendResponseBody)
['choices'])[0], 0, 100)
Output:
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that you can use to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures that your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
This scenario ensures high availability of the large language models for your enterprise
users. The Azure application gateway provides an effective layer-7 application delivery
mechanism to ensure fast and consistent access to applications. You can use API
Management to configure, manage, and monitor access to your models. The inherent
high availability of platform services like Storage, Key Vault, and Virtual Network ensure
high reliability for your application. Finally, multiple instances of Azure OpenAI ensure
service resilience in case of application-level failures. These architecture components can
help you ensure the reliability of your application at enterprise scale.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.
To help you explore the cost of running this scenario, we've preconfigured all the
services in the Azure pricing calculator. To learn how the pricing would change for your
use case, change the appropriate variables to match your expected traffic.
The following three sample cost profiles provide estimates based on the amount of
traffic. (The estimates assume that a document contains approximately 1,000 tokens.)
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Other contributors:
Next steps
Azure OpenAI request form
Best practices for prompt engineering with OpenAI API
Azure OpenAI: Documentation, quickstarts, API reference
Azure-Samples/openai-python-enterprise-logging (GitHub)
Configure Azure Cognitive Services virtual networks
Related resources
Protect APIs with Azure Application Gateway and Azure API Management
Query-based document summarization
AI architecture design
Secure research environment for
regulated data
Azure Data Science Virtual Machines Azure Machine Learning Azure Data Factory
Architecture
Approver
Network
6 security
group
Logic app
4 3
Researcher
7 Approved Data science virtual Azure Virtual
data Desktop
machine
2 Managed
Data Factory identity
Blob Storage 5 Virtual Virtual
(private) network network
gateway gateway
8 Azure Machine
Learning compute
User-defined routing
Virtual network
1 table
Firewall
Data owner Blob Storage policy
(public) Key Vault Azure Firewall
Virtual network
Resource group Resource group
Dataflow
1. Data owners upload datasets into a public blob storage account. The data is
encrypted by using Microsoft-managed keys.
2. Azure Data Factory uses a trigger that starts copying of the uploaded dataset to a
specific location (import path) on another storage account with security controls.
The storage account can only be reached through a private endpoint. Also, it's
accessed by a service principal with limited permissions. Data Factory deletes the
original copy making the dataset immutable.
4. The dataset in the secure storage account is presented to the data science VMs
provisioned in a secure network environment for research work. Much of the data
preparation is done on those VMs.
5. The secure environment has Azure Machine Learning compute that can access the
dataset through a private endpoint for users for AML capabilities, such as to train,
deploy, automate, and manage machine learning models. At this point, models are
created that meet regulatory guidelines. All model data is de-identified by
removing personal information.
The app starts an approval process requesting a review of data that is queued to
be exported. The manual reviewers ensure that sensitive data isn't exported. After
the review process, the data is either approved or denied.
7 Note
If an approval step is not required on exfiltration, the Logic App step could be
omitted.
7. If the de-identified data is approved, it's sent to the Data Factory instance.
8. Data Factory moves the data to the public storage account in a separate container
to allow external researchers to have access to their exported data and models.
Alternately, you can provision another storage account in a lower security
environment.
Components
This architecture consists of several Azure cloud services that scale resources according
to need. The services and their roles are described below. For links to product
documentation to get started with these services, see Next steps.
Azure Data Science Virtual Machine (DSVM): VMs that are configured with
tools used for data analytics and machine learning.
Azure Machine Learning: Used to train, deploy, automate, and manage machine
learning models and to manage the allocation and use of ML compute resources.
Azure Machine Learning Compute: A cluster of nodes that are used to train and
test machine learning and AI models. The compute is allocated on demand based
on an automatic scaling option.
Azure Blob storage: There are two instances. The public instance is used to
temporarily store the data uploaded by data owners. Also, it stores deidentified
data after modeling in a separate container. The second instance is private. It
receives the training and test data sets from Machine Learning that are used by the
training scripts. Storage is mounted as a virtual drive onto each node of a Machine
Learning Compute cluster.
Azure Virtual Desktop is used as a jump box to gain access to the resources in
the secure environment with streaming applications and a full desktop, as needed.
Alternately, you can use Azure Bastion . But, have a clear understanding of the
security control differences between the two options. Virtual Desktop has some
advantages:
Ability to stream an app like VSCode to run notebooks against the machine
learning compute resources.
Ability to limit copy, paste, and screen captures.
Support for Microsoft Entra authentication to DSVM.
Azure Logic Apps provides automated low-code workflow to develop both the
trigger and release portions of the manual approval process.
Microsoft Defender for Cloud is used to evaluate the overall security posture of
the implementation and provide an attestation mechanism for regulatory
compliance. Issues that were previously found during audits or assessments can be
discovered early. Use features to track progress such as secure score and
compliance score.
Governance components
Azure Policy helps to enforce organizational standards and to assess compliance
at-scale.
Alternatives
This solution uses Data Factory to move the data to the public storage account in a
separate container, in order to allow external researchers to have access to their
exported data and models. Alternately, you can provision another storage account
in a lower security environment.
This solution uses Azure Virtual Desktop as a jump box to gain access to the
resources in the secure environment, with streaming applications and a full
desktop. Alternately, you can use Azure Bastion. But, Virtual Desktop has some
advantages, which include the ability to stream an app, to limit copy/paste and
screen captures, and to support AAC authentication. You can also consider
configuring Point to Site VPN for offline training locally. This will also help save
costs of having multiple VMs for workstations.
To secure data at rest, this solution encrypts all Azure Storage with Microsoft-
managed keys using strong cryptography. Alternately, you can use customer-
managed keys. The keys must be stored in a managed key store.
Scenario details
By following the guidance you can maintain full control of your research data, have
separation of duties, and meet strict regulatory compliance standards while providing
collaboration between the typical roles involved in a research-oriented workload; data
owners, researchers, and approvers.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
The main objective of this architecture is to provide a secure and trusted research
environment that strictly limits the exfiltration of data from the secure area.
Network security
Azure resources that are used to store, test, and train research data sets are provisioned
in a secure environment. That environment is an Azure Virtual Network (VNet) that has
network security groups (NSGs) rules to restrict access, mainly:
Inbound and outbound access to the public internet and within the VNet.
Access to and from specific services and ports. For example, this architecture
blocks all ports ranges except the ones required for Azure Services (such as Azure
Monitor). A full list of Service Tags and the corresponding services can be found
here.
Also, access from VNet with Azure Virtual Desktop (AVD) on ports limited to
approved access methods is accepted, all other traffic is denied. When compared
to this environment, the other VNet (with AVD) is relatively open.
The main blob storage in the secure environment is off the public internet. It's only
accessible within the VNet through private endpoint connections and Azure Storage
Firewalls. It's used to limit the networks from which clients can connect to Azure file
shares.
This architecture uses credential-based authentication for the main data store that is in
the secure environment. In this case, the connection information like the subscription ID
and token authorization is stored in a key vault. Another option is to create identity-
based data access, where your Azure account is used to confirm if you have access to
the Storage service. In an identity-based data access scenario, no authentication
credentials are saved. For the details on how to use identity-based data access, see
Connect to storage by using identity-based data access.
The compute cluster can solely communicate within the virtual network, by using the
Azure Private Link ecosystem and service/private endpoints, rather than using Public IP
for communication. Make sure you enable No public IP. For details about this feature,
which is currently in preview (as of 3/7/2022), see No public IP for compute instances.
The secure environment uses Azure Machine Learning compute to access the dataset
through a private endpoint. Additionally, Azure Firewall can be used to control
outbound access from Azure Machine Learning compute. To learn about how to
configure Azure Firewall to control access to Azure Machine Learning compute, which
resides in a machine learning workspace, see Configure inbound and outbound network
traffic.
To learn one of the ways to secure an Azure Machine Learning environment, see the
blog post, Secure Azure Machine Learning Service (AMLS) Environment .
For Azure services that cannot be configured effectively with private endpoints, or to
provide stateful packet inspection, consider using Azure Firewall or a third-party
network virtual appliance (NVA).
Identity management
The Blob storage access is through Azure Role-based access controls (RBAC).
Azure Virtual Desktop supports Microsoft Entra authentication to DSVM.
Data Factory uses managed identity to access data from the blob storage. DSVMs also
uses managed identity for remediation tasks.
Data security
To secure data at rest, all Azure Storage is encrypted with Microsoft-managed keys using
strong cryptography.
Alternately, you can use customer-managed keys. The keys must be stored in a
managed key store. In this architecture, Azure Key Vault is deployed in the secure
environment to store secrets such as encryption keys and certificates. Key Vault is
accessed through a private endpoint by the resources in the secure VNet.
Governance considerations
Enable Azure Policy to enforce standards and provide automated remediation to bring
resources into compliance for specific policies. The policies can be applied to a project
subscription or at a management group level as a single policy or as part of a regulatory
Initiative.
For example, in this architecture Azure Policy Guest Configuration was applied to all VMs
in scope. The policy can audit operating systems and machine configuration for the Data
Science VMs.
VM image
The Data Science VMs run customized base images. To build the base image, we highly
recommend technologies like Azure Image Builder. This way you can create a repeatable
image that can be deployed when needed.
The base image might need updates, such as additional binaries. Those binaries should
be uploaded to the public blob storage and flow through the secure environment, much
like the datasets are uploaded by data owners.
Other considerations
Most research solutions are temporary workloads and don't need to be available for
extended periods. This architecture is designed as a single-region deployment with
availability zones. If the business requirements demand higher availability, replicate this
architecture in multiple regions. You would need other components, such as global load
balancer and distributor to route traffic to all those regions. As part of your recovery
strategy, capturing and creating a copy of the customized base image with Azure Image
Builder is highly recommended.
The size and type of the Data Science VMs should be appropriate to the style of work
being performed. This architecture is intended to support a single research project and
the scalability is achieved by adjusting the size and type of the VMs and the choices
made for compute resources available to AML.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
The cost of DSVMs depends on the choice of the underlying VM series. Because the
workload is temporary, the consumption plan is recommended for the Logic App
resource. Use the Azure pricing calculator to estimate costs based on estimated sizing
of resources needed.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Microsoft Data Science Virtual Machine (DSVM)
What is Azure Machine Learning?
Azure Machine Learning Compute
Introduction to Azure Blob storage
Introduction to Azure Data Factory
Azure Virtual Desktop
Microsoft Defender for Cloud
Microsoft Sentinel
Azure Monitor
Azure Policy
Azure Policy Guest Configuration
Related resources
Compare the machine learning products and technologies from Microsoft
Azure Machine Learning architecture
Scale AI and machine learning initiatives in regulated industries
Many models machine learning (ML) at scale with Azure Machine Learning
Compare Microsoft machine learning
products and technologies
Article • 08/31/2023
Learn about the machine learning products and technologies from Microsoft. Compare
options to help you choose how to most effectively build, deploy, and manage your
machine learning solutions.
Azure Machine Learning Managed platform for Use a pretrained model. Or, train,
machine learning deploy, and manage models on Azure
using Python and CLI
Azure SQL Managed Instance In-database machine Train and deploy models inside Azure
Machine Learning Services learning for SQL SQL Managed Instance
Machine learning in Azure Analytics service with Train and deploy models inside Azure
Synapse Analytics machine learning Synapse Analytics
Machine learning and AI with Machine learning in Train and deploy models inside Azure
ONNX in Azure SQL Edge SQL on IoT SQL Edge
Azure Databricks Apache Spark-based Build and deploy models and data
analytics platform workflows using integrations with
open-source machine learning
libraries and the MLFlow platform.
SQL Server Machine Learning Services In-database machine Train and deploy models inside
learning for SQL SQL Server
Machine Learning Services on SQL Machine learning in Train and deploy models on
Server Big Data Clusters Big Data Clusters SQL Server Big Data Clusters
Azure Data Science Virtual machine with pre- Develop machine learning solutions
Virtual Machine installed data science tools in a pre-configured environment
Item Description
Key benefits Code first (SDK) and studio & drag-and-drop designer web interface
authoring options.
Vision: Object detection, face recognition, OCR, etc. See Computer Vision, Face,
Form Recognizer.
Speech: Speech-to-text, text-to-speech, speaker recognition, etc. See Speech
Service.
Language: Translation, Sentiment analysis, key phrase extraction, language
understanding, etc. See Translator, Text Analytics, Language Understanding, QnA
Maker
Decision: Anomaly detection, content moderation, reinforcement learning. See
Anomaly Detector, Content Moderator, Personalizer.
Use Cognitive Services to develop apps across devices and platforms. The APIs keep
improving, and are easy to set up.
Item Description
Supported Various options depending on the service. Standard ones are C#, Java,
languages JavaScript, and Python.
Key benefits Build intelligent applications using pre-trained models available through
REST API and SDK.
Variety of models for natural communication methods with vision, speech,
language, and decision.
No machine learning or data science expertise required.
Use SQL machine learning when you need built-in AI and predictive analytics on
relational data in SQL.
Item Description
Deployment
Considerations Assumes a SQL database as the data tier for your application.
Use the Data Science VM when you need to run or host your jobs on a single node. Or if
you need to remotely scale up your processing on a single machine.
Item Description
Key benefits Reduced time to install, manage, and troubleshoot data science tools and
frameworks.
The latest versions of all commonly used tools and frameworks are included.
Virtual machine options include highly scalable images with GPU capabilities for
intensive data modeling.
Running a virtual machine incurs Azure charges, so you must be careful to have
it running only when required.
Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform optimized for the
Microsoft Azure cloud services platform. Databricks is integrated with Azure to provide
one-click setup, streamlined workflows, and an interactive workspace that enables
collaboration between data scientists, data engineers, and business analysts. Use
Python, R, Scala, and SQL code in web-based notebooks to query, visualize, and model
data.
Use Databricks when you want to collaborate on building machine learning solutions on
Apache Spark.
Item Description
ML.NET
ML.NET is an open-source, and cross-platform machine learning framework. With
ML.NET, you can build custom machine learning solutions and integrate them into your
.NET applications. ML.NET offers varying levels of interoperability with popular
frameworks like TensorFlow and ONNX for training and scoring machine learning and
deep learning models. For resource-intensive tasks like training image classification
models, you can take advantage of Azure to train your models in the cloud.
Use ML.NET when you want to integrate machine learning solutions into your .NET
applications. Choose between the API for a code-first experience and Model Builder or
the CLI for a low-code experience.
Item Description
Languages C#, F#
supported
Windows ML
Windows ML inference engine allows you to use trained machine learning models in
your applications, evaluating trained models locally on Windows 10 devices.
Use Windows ML when you want to use trained machine learning models within your
Windows applications.
Item Description
MMLSpark
Microsoft ML for Apache Spark (MMLSpark) is an open-source library that expands
the distributed computing framework Apache Spark . MMLSpark adds many deep
learning and data science tools to the Spark ecosystem, including seamless integration
of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK),
LightGBM , LIME (Model Interpretability) , and OpenCV . You can use these tools to
create powerful predictive models on any Spark cluster, such as Azure Databricks or
Cosmic Spark.
MMLSpark also brings new networking capabilities to the Spark ecosystem. With the
HTTP on Spark project, users can embed any web service into their SparkML models.
Additionally, MMLSpark provides easy-to-use tools for orchestrating Azure Cognitive
Services at scale. For production-grade deployment, the Spark Serving project enables
high throughput, submillisecond latency web services, backed by your Spark cluster.
Item Description
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
To learn about all the Artificial Intelligence (AI) development products available
from Microsoft, see Microsoft AI platform .
For training in developing AI and Machine Learning solutions with Microsoft, see
Microsoft Learn training.
Related resources
Choose a Microsoft cognitive services technology
Artificial intelligence (AI) architecture design
How Azure Machine Learning works: resources and assets
Query-based document summarization
Article • 08/31/2023
This guide shows how to perform document summarization by using the Azure OpenAI
GPT-3 model. It describes concepts that are related to the document summarization
process, approaches to the process, and recommendations on which model to use for
specific use cases. Finally, it presents two use cases, together with sample code snippets,
to help you understand key concepts.
Architecture
The following diagram shows how a user query fetches relevant data. The summarizer
uses GPT-3 to generate a summary of the text of the most relevant document. In this
architecture, the GPT-3 endpoint is used to summarize the text.
Workflow
This workflow occurs in near-real time.
Scenario details
Enterprises frequently create and maintain a knowledge base about business processes,
customers, products, and information. However, returning relevant content based on a
user query of a large dataset is often challenging. The user can query the knowledge
base and find an applicable document by using methods like page rank, but delving
further into the document to search for relevant information typically becomes a manual
task that takes time. However, with recent advances in foundation transformer models
like the one developed by OpenAI, the query mechanism has been refined by semantic
search methods that use encoding information like embeddings to find relevant
information. These developments enable the ability to summarize content and present it
to the user in a concise and succinct way.
Some benefits of using a summarization service for any use case are:
In-context learning
Azure OpenAI Service uses a generative completion model. The model uses natural
language instructions to identify the requested task and the skill required, a process
known as prompt engineering. When you use this approach, the first part of the prompt
includes natural language instructions and/or examples of the desired task. The model
completes the task by predicting the most probable next text. This technique is known
as in-context learning.
With in-context learning, language models can learn tasks from just a few examples. The
language model is provided with a prompt that contains a list of input-output pairs that
demonstrate a task, and then with a test input. The model makes a prediction by
conditioning on the prompt and predicting the next tokens.
There are three main approaches to in-context learning: zero-shot learning, few-shot
learning, and fine-tuning methods that change and improve the output. These
approaches vary based on the amount of task-specific data that's provided to the
model.
Zero-shot: In this approach, no examples are provided to the model. Only the task
request is provided as input. In zero-shot learning, the model depends on previously
trained concepts. It responds based only on data that it's trained on. It doesn't
necessarily understand the semantic meaning, but it has a statistic understanding that's
based on everything that it's learned from the internet about what should be generated
next. The model attempts to relate the given task to existing categories that it has
already learned about and responds accordingly.
Few-shot: In this approach, several examples that demonstrate the expected answer
format and content are included in the call prompt. The model is provided with a very
small training dataset to guide its predictions. Training with a small set of examples
enables the model to generalize and understand unrelated but previously unseen tasks.
Creating few-shot examples can be challenging because you need to accurately
articulate the task that you want the model to perform. One commonly observed
problem is that models are sensitive to the writing style that's used in the training
examples, especially small models.
When you create a GPT-3 solution, the main effort is in the design and content of the
training prompt.
Prompt engineering
Prompt engineering is a natural language processing discipline that involves discovering
inputs that yield desirable or useful outputs. When a user prompts the system, the way
the content is expressed can dramatically change the output. Prompt design is the most
significant process for ensuring that the GPT-3 model provides a desirable and
contextual response.
The architecture described in this article uses the completions endpoint for
summarization. The completions endpoint is an Azure Cognitive Services API that
accepts a partial prompt or context as input and returns one or more outputs that
continue or complete the input text. A user provides input text as a prompt, and the
model generates text that attempts to match the context or pattern that's provided.
Prompt design is highly dependent on the task and data. Incorporating prompt
engineering into a fine-tuning dataset and investigating what works best before using
the system in production requires significant time and effort.
Prompt design
GPT-3 models can perform multiple tasks, so you need to be explicit in the goals of the
design. The models estimate the desired output based on the provided prompt.
For example, if you input the words "Give me a list of cat breeds," the model doesn't
automatically assume that you're asking for a list of cat breeds. You could be asking the
model to continue a conversation in which the first words are "Give me a list of cat
breeds" and the next ones are "and I'll tell you which ones I like." If the model just
assumed that you wanted a list of cats, it wouldn't be as good at content creation,
classification, or other tasks.
As described in Learn how to generate or manipulate text, there are three basic
guidelines for creating prompts:
Show and tell. Improve the clarity about what you want by providing instructions,
examples, or a combination of the two. If you want the model to rank a list of
items in alphabetical order or to classify a paragraph by sentiment, show it that
that's what you want.
Provide quality data. If you're building a classifier or want a model to follow a
pattern, be sure to provide enough examples. You should also proofread your
examples. The model can usually recognize spelling mistakes and return a
response, but it might assume misspellings are intentional, which can affect the
response.
Check your settings. The temperature and top_p settings control how
deterministic the model is in generating a response. If you ask it for a response
that has only one right answer, configure these settings at a lower level. If you
want more diverse responses, you might want to configure the settings at a higher
level. A common error is to assume that these settings are "cleverness" or
"creativity" controls.
Alternatives
Azure conversational language understanding is an alternative to the summarizer used
here. The main purpose of conversational language understanding is to build models
that predict the overall intention of an incoming utterance, extract valuable information
from it, and produce a response that aligns with the topic. It's useful in chatbot
applications when it can refer to an existing knowledge base to find the suggestion that
best corresponds to the incoming utterance. It doesn't help much when the input text
doesn't require a response. The intent in this architecture is to generate a short
summary of long textual content. The essence of the content is described in a concise
manner and all important information is represented.
Example scenarios
Zero-shot prompt engineering is used to summarize the bills. The prompt and settings
are then modified to generate different summary outputs.
Dataset
The first dataset is the BillSum dataset for summarization of US Congressional and
California state bills. This example uses only the Congressional bills. The data is split into
18,949 bills to use for training and 3,269 bills to use for testing. BillSum focuses on mid-
length legislation that's between 5,000 and 20,000 characters long. It's cleaned and
preprocessed.
For more information about the dataset and instructions for download, see FiscalNote /
BillSum .
BillSum schema
In this use case, the text and summary elements are used.
Zero-shot
The goal here is to teach the GPT-3 model to learn conversation-style input. The
completions endpoint is used to create an Azure OpenAI API and a prompt that
generates the best summary of the bill. It's important to create the prompts carefully so
that they extract relevant information. To extract general summaries from a given bill,
the following format is used.
Python
openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01-preview"
prompt_i = 'Summarize the legislative bill given the title and the
text.\n\nTitle:\n'+" ".join([normalize_text(bill_title_1)])+ '\n\nText:\n'+
" ".join([normalize_text(bill_text_1)])+'\n\nSummary:\n'
response = openai.Completion.create(
engine=TEXT_DAVINCI_001
prompt=prompt_i,
temperature=0.4,
max_tokens=500,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
stop=['\n\n###\n\n'], # The ending token used during inference. Once it
reaches this token, GPT-3 knows the completion is over.
best_of=1
)
= 1
Ground truth: National Science Education Tax Incentive for Businesses Act of 2007 -
Amends the Internal Revenue Code to allow a general business tax credit for
contributions of property or services to elementary and secondary schools and for
teacher training to promote instruction in science, technology, engineering, or
mathematics.
Zero-shot model summary: The National Science Education Tax Incentive for Businesses
Act of 2007 would create a new tax credit for businesses that make contributions to
science, technology, engineering, and mathematics (STEM) education at the elementary
and secondary school level. The credit would be equal to 100 percent of the qualified
STEM contributions of the taxpayer for the taxable year. Qualified STEM contributions
would include STEM school contributions, STEM teacher externship expenses, and STEM
teacher training expenses.
Fine-tuning
Fine-tuning improves upon zero-shot learning by training on more examples than you
can include in the prompt, so you achieve better results on a wider number of tasks.
After a model is fine-tuned, you don't need to provide examples in the prompt. Fine-
tuning saves money by reducing the number of tokens required and enables lower-
latency requests.
For more information, see How to customize a model with Azure OpenAI Service.
This step enables you to improve upon the zero-shot model by incorporating prompt
engineering into the prompts that are used for fine-tuning. Doing so helps give
directions to the model on how to approach the prompt/completion pairs. In a fine-tune
model, prompts provide a starting point that the model can learn from and use to make
predictions. This process enables the model to start with a basic understanding of the
data, which can then be improved upon gradually as the model is exposed to more data.
Additionally, prompts can help the model to identify patterns in the data that it might
otherwise miss.
The same prompt engineering structure is also used during inference, after the model is
finished training, so that the model recognizes the behavior that it learned during
training and can generate completions as instructed.
Python
return proc_df
df_staged_full_train = stage_examples(df_prompt_completion_train)
df_staged_full_val = stage_examples(df_prompt_completion_val)
Now that the data is staged for fine-tuning in the proper format, you can start running
the fine-tune commands.
Next, you can use the OpenAI CLI to help with some of the data preparation steps. The
OpenAI tool validates data, provides suggestions, and reformats data.
Python
Python
payload = {
"model": "curie",
"training_file": " -- INSERT TRAINING FILE ID -- ",
"validation_file": "-- INSERT VALIDATION FILE ID --",
"hyperparams": {
"n_epochs": 1,
"batch_size": 200,
"learning_rate_multiplier": 0.1,
"prompt_loss_weight": 0.0001
}
}
Python
data = r.json()
print('Endpoint Called: {endpoint}'.format(endpoint = url))
print('Status Code: {status}'.format(status= r.status_code))
print('Fine tuning ID: {id}'.format(id=fine_tune_id))
print('Status: {status}'.format(status = data['status']))
print('Response Information \n\n {text}'.format(text=r.text))
Ground truth: National Science Education Tax Incentive for Businesses Act of 2007 -
Amends the Internal Revenue Code to allow a general business tax credit for
contributions of property or services to elementary and secondary schools and for
teacher training to promote instruction in science, technology, engineering, or
mathematics.
Fine-tuned model summary: This bill provides a tax credit for contributions to
elementary and secondary schools that benefit science, technology, engineering, and
mathematics education. The credit is equal to 100% of qualified STEM contributions
made by taxpayers during the taxable year. Qualified STEM contributions include: (1)
STEM school contributions, (2) STEM teacher externship expenses, and (3) STEM teacher
training expenses. The bill also provides a tax credit for contributions to elementary and
secondary schools that benefit science, technology, engineering, or mathematics
education. The credit is equal to 100% of qualified STEM service contributions made by
taxpayers during the taxable year. Qualified STEM service contributions include: (1)
STEM service contributions paid or incurred during the taxable year for services
provided in the United States or on a military base outside the United States; and (2)
STEM inventory property contributed during the taxable year which is used by an
educational organization located in the United States or on a military base outside the
United States in providing education in grades K-12 in science, technology, engineering
or mathematics.
For the results of summarizing a few more bills by using the zero-shot and fine-tune
approaches, see Results for BillSum Dataset .
Observations: Overall, the fine-tuned model does an excellent job of summarizing the
bill. It captures domain-specific jargon and the key points that are represented but not
explained in the human-written ground truth. It differentiates itself from the zero-shot
model by providing a more detailed and comprehensive summary.
Fine-tuning is not applied in the financial use case because there's not enough data
available to complete that step.
Dataset
The dataset for this use case is technical and includes key quantitative metrics to assess
a company's performance.
indexed).
completion : The ground truth summary of the report.
In this use case, Rathbone's financial report , from the dataset, will be summarized.
Rathbone's is an individual investment and wealth management company for private
clients. The report highlights Rathbone's performance in 2020 and mentions
performance metrics like profit, FUMA, and income. The key information to summarize is
on page 1 of the PDF.
Python
openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01-preview"
name = os.path.abspath(os.path.join(os.getcwd(), '---INSERT PATH OF LOCALLY
DOWNLOADED RATHBONES_2020_PRELIM_RESULTS---')).replace('\\', '/')
pages_to_summarize = [0]
# Using pdfminer.six to extract the text
# !pip install pdfminer.six
from pdfminer.high_level import extract_text
t = extract_text(name
, page_numbers=pages_to_summarize
)
print("Text extracted from " + name)
t
Zero-shot approach
When you use the zero-shot approach, you don't provide solved examples. You provide
only the command and the unsolved input. In this example, the Instruct model is used.
This model is specifically intended to take in an instruction and record an answer for it
without extra context, which is ideal for the zero-shot approach.
After you extract the text, you can use various prompts to see how they influence the
quality of the summary:
Python
#Using the text from the Rathbone's report, you can try different prompts to
see how they affect the summary
response = openai.Completion.create(
engine="davinci-instruct",
prompt=prompt_i,
temperature=0,
max_tokens=2048-int(len(prompt_i.split())*1.5),
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(response.choices[0].text)
>>>
- Funds under management and administration (FUMA) reached £54.7 billion at
31 December 2020, up 8.5% from £50.4 billion at 31 December 2019
- Operating income totalled £366.1 million, 5.2% ahead of the prior year
(2019: £348.1 million)
- Underlying1 profit before tax totalled £92.5 million, an increase of 4.3%
(2019: £88.7 million); underlying operating margin of 25.3% (2019: 25.5%)
# Different prompt
response = openai.Completion.create(
engine="davinci-instruct",
prompt=prompt_i,
temperature=0,
max_tokens=2048-int(len(prompt_i.split())*1.5),
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(response.choices[0].text)
>>>
- Funds under management and administration (FUMA) grew by 8.5% to reach
£54.7 billion at 31 December 2020
- Underlying profit before tax increased by 4.3% to £92.5 million,
delivering an underlying operating margin of 25.3%
- The board is announcing a final 2020 dividend of 47 pence per share, which
brings the total dividend to 72 pence per share, an increase of 2.9% over
2019
Challenges
As you can see, the model might produce metrics that aren't mentioned in the
original text.
Proposed solution: You can resolve this problem by changing the prompt.
The summary might focus on one section of the article and neglect other
important information.
Proposed solution: You can try a summary of summaries approach. Divide the
report into sections and create smaller summaries that you can then summarize to
create the output summary.
Python
# Body of function
text = extract_text(name
, page_numbers=pages_to_summarize
)
r = splitter(200, text)
tok_l = int(2000/len(r))
tok_l_w = num2words(tok_l)
res_lis = []
# Stage 1: Summaries
for i in range(len(r)):
prompt_i = f'Extract and summarize the key financial numbers and
percentages mentioned in the Text in less than {tok_l_w}
words.\n\nText:\n'+normalize_text(r[i])+'\n\nSummary in one paragraph:'
response = openai.Completion.create(
engine=TEXT_DAVINCI_001,
prompt=prompt_i,
temperature=0,
max_tokens=tok_l,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
t = trim_incomplete(response.choices[0].text)
res_lis.append(t)
response = openai.Completion.create(
engine=TEXT_DAVINCI_001,
prompt=prompt_i,
temperature=0,
max_tokens=200,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(trim_incomplete(response.choices[0].text))
The input prompt includes the original text from Rathbone's financial report for a
specific year.
Ground truth: Rathbones has reported revenue of £366.1m in 2020, up from £348.1m in
2019, and an increase in underlying profit before tax to £92.5m from £88.7m. Assets
under management rose 8.5% from £50.4bn to £54.7bn, with assets in wealth
management increasing 4.4% to £44.9bn. Net inflows were £2.1bn in 2020 compared
with £600m in the previous year, driven primarily by £1.5bn inflows into its funds
business and £400m due to the transfer of assets from Barclays Wealth.
Observations: The summary of summaries approach generates a great result set that
resolves the challenges encountered initially when a more detailed and comprehensive
summary was provided. It does a great job of capturing the domain-specific jargon and
the key points, which are represented in the ground truth but not explained well.
The zero-shot model works well for summarizing mainstream documents. If the data is
industry-specific or topic-specific, contains industry-specific jargon, or requires industry-
specific knowledge, fine-tuning performs best. For example, this approach works well for
medical journals, legal forms, and financial statements. You can use the few-shot
approach instead of zero-shot to provide the model with examples of how to formulate
a summary, so it can learn to mimic the summary provided. For the zero-shot approach,
this solution doesn't retrain the model. The model's knowledge is based on the GPT-3
training. GPT-3 is trained with almost all available data from the internet. It performs
well for tasks that don't require specific knowledge.
For the results of using the zero-shot summary of summaries approach on a few reports
in the financial dataset, see Results for Summary of Summaries .
Recommendations
There are many ways to approach summarization by using GPT-3, including zero-shot,
few-shot, and fine-tuning. The approaches produce summaries of varying quality. You
can explore which approach produces the best results for your intended use case.
Based on observations on the testing presented in this article, here are few
recommendations:
Zero-shot is best for mainstream documents that don't require specific domain
knowledge. This approach attempts to capture all high-level information in a
succinct, human-like manner and provides a high-quality baseline summary. Zero-
shot creates a high-quality summary for the legal dataset that's used in the tests in
this article.
Few-shot is difficult to use for summarizing long documents because the token
limitation is exceeded when an example text is provided. You can instead use a
zero-shot summary of summaries approach for long documents or increase the
dataset to enable successful fine-tuning. The summary of summaries approach
generates excellent results for the financial dataset that's used in these tests.
Fine-tuning is most useful for technical or domain-specific use cases when the
information isn't readily available. To achieve the best results with this approach,
you need a dataset that contains a couple thousand samples. Fine-tuning captures
the summary in a few templated ways, trying to conform to how the dataset
presents the summaries. For the legal dataset, this approach generates a higher
quality of summary than the one created by the zero-shot approach.
Evaluating summarization
There are multiple techniques for evaluating the performance of summarization models.
Here's an example:
Python
Here's an example:
Python
import torchmetrics
from torchmetrics.text.bert import BERTScore
preds = "You should have ice cream in the summer"
target = "Ice creams are great when the weather is hot"
bertscore = BERTScore()
score = bertscore(preds, target)
print(score)
Similarity matrix. A similarity matrix is a representation of the similarities between
different entities in a summarization evaluation. You can use it to compare different
summaries of the same text and measure their similarity. It's represented by a two-
dimensional grid, where each cell contains a measure of the similarity between two
summaries. You can measure the similarity by using various methods, like cosine
similarity, Jaccard similarity, and edit distance. You then use the matrix to compare the
summaries and determine which one is the most accurate representation of the original
text.
Here's a sample command that gets the similarity matrix of a BERTScore comparison of
two similar sentences:
Python
The first sentence, "The cat is on the porch by the tree", is referred to as the candidate.
The second sentence is referred to as the reference. The command uses BERTScore to
compare the sentences and generate a matrix.
This following matrix displays the output that's generated by the preceding command:
For more information, see SummEval: Reevaluating Summarization Evaluation . For a
PyPI toolkit for summarization, see summ-eval 0.892 .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributors:
Next steps
Azure OpenAI - Documentation, quickstarts, API reference
What are intents in LUIS?
Conversational language understanding
Jupyter Notebook with technical details and execution of this use case
Related resources
AI architecture design
Choose a Microsoft cognitive services technology
Natural language processing technology
Build language model pipelines
with memory
Bing Web Search Azure Cache for Redis Azure Pipelines
Stay ahead of the competition by being informed and having a deep understanding of
your products and competitor products. An AI/machine learning pipeline helps you
quickly and efficiently gather, analyze, and summarize relevant information. This
architecture includes several powerful Azure OpenAI Service models. These models pair
with the popular open-source LangChain framework that's used to develop applications
that are powered by language models.
7 Note
Some parts in the introduction, components, and workflow of this article were
generated with the help of ChatGPT! Try it for yourself , or try it for your
enterprise.
Architecture
1. Internal company documents for products are imported and converted into
searchable vectors. Product-related documents are collected from departments,
such as sales, marketing, and product development. These documents are then
scanned and converted into text by using optical character recognition (OCR)
technology.
2. A LangChain chunking utility chunks the documents into smaller, more
manageable pieces. Chunking breaks down the text into meaningful phrases or
sentences that can be analyzed separately and improves the accuracy of the
pipeline's search capabilities.
3. The language model converts each chunk into a vectorized embedding.
Embeddings are a type of representation that capture the meaning and context of
the text. By converting each chunk into a vectorized embedding, you can store and
search for documents based on their meaning rather than their raw text. To
prevent loss of context within each document chunk, LangChain provides several
utilities for this text splitting step, like capabilities for sliding windows or specifying
text overlap. Some key features include utilities for tagging chunks with document
metadata, optimizing the document retrieval step, and downstream reference.
4. Create an index in a vector store database to store the raw document text,
embeddings vectors, and metadata. The resulting embeddings are stored in a
vector store database along with the raw text of the document and any relevant
metadata, such as the document's title and source.
After the batch pipeline is complete, the real-time, asynchronous pipeline searches for
relevant information. The following steps are taken:
5. Enter a query and relevant metadata, such as your role in the company or the
business unit that you work in. An embeddings model then converts your query
into a vectorized embedding.
6. The orchestrator language model decomposes your query, or main task, into the
set of subtasks that are required to answer your query. Converting the main task
into a series of simpler subtasks allows the language model to address each task
more accurately, which results in better answers with less tendency for inaccuracy.
7. The resulting embedding and decomposed subtasks are stored in the LangChain
model's memory.
a. Top internal document chunks that are relevant to your query are retrieved from
your internal database. A fast vector search is performed for the top n similar
documents that are stored as vectors in Azure Cache for Redis.
b. In parallel, a web search for similar external products is performed via the
LangChain Bing Search language model plugin with a generated search query
that the orchestrator language model composes. Results are stored in the
external model memory component.
8. The vector store database is queried and returns the top relevant product
information pages (chunks and references). The system queries the vector store
database by using your query embedding and returns the most relevant product
information pages, along with the relevant text chunks and references. The
relevant information is stored in LangChain's model memory.
9. The system uses the information that’s stored in LangChain's model memory to
create a new prompt, which is sent to the orchestrator language model to build a
summary report that’s based on your query, company internal knowledge base,
and external web results.
10. Optionally, the output from the previous step is passed to a moderation filter to
remove unwanted information. The final competitive product report is passed to
you.
Components
Azure OpenAI Service provides REST API access to OpenAI's powerful language
models, including the GPT-3, GPT-3.5, GPT-4, and embeddings model series. You
can easily adapt these models to your specific task, such as content generation,
summarization, semantic search, converting text to semantically powerful
embeddings vectors, and natural-language-to-code translation.
Scenario details
This architecture uses an AI/machine learning pipeline, LangChain, and language models
to create a comprehensive analysis of how your product compares to similar competitor
products. The pipeline consists of two main components: a batch pipeline and a real-
time, asynchronous pipeline. When you send a query to the real-time pipeline, the
orchestrator language model, often GPT-4 or the most powerful available language
model, derives a set of tasks to answer your question. These subtasks invoke other
language models and APIs to mine the internal company product database and the
public internet to build a report that shows the competitive position of your products
versus the competitor products.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributor:
Related resources
AI architecture design
Batch processing
Types of language API services
Implement custom speech-to-text
Azure AI services Azure AI Speech Azure Machine Learning
This two-part guide describes various approaches for efficiently implementing high-
quality speech-aware applications. It focuses on extending and customizing the baseline
model of speech-to-text functionality that's provided by the Azure Cognitive Services
Speech service.
This article describes the problem space and decision-making process for designing
your solution. The second article, Deploy a custom speech-to-text solution, provides a
use case for applying these instructions and recommended practices.
On the left side of the spectrum, Azure Cognitive Services enables a quick and low-
friction implementation of AI capabilities into applications via pre-trained models.
Microsoft curates extensive datasets to train and build these baseline models. As a
result, you can use baseline models with no additional training data. They're consumed
via enhanced-security programmatic API calls.
Cognitive Services includes:
When the pre-built baseline models don't perform accurately enough on your data, you
can customize them by adding training data that's relative to the problem domain. This
customization requires the extra effort of gathering adequate data to train and evaluate
an acceptable model. Cognitive Services that are customizable include Custom Vision,
Custom Translator, Custom Speech, and CLU. Extending pre-built Cognitive Services
models is in the center of the spectrum. Most of this article is focused on that central
area.
Alternatively, when models and training data focus on a specific scenario and require a
proprietary training dataset, Azure Machine Learning provides custom solution
resources, tools, compute, and workflow guidance to support building entirely custom
models. This scenario appears on the right side of the spectrum. These models are built
from scratch. Developing a model by using Azure Machine Learning typically ranges
from using visual tools like AutoML to programmatically developing the model by using
notebooks.
Note, however, that the baseline model might not be sufficient if the audio contains
ambient noise or includes a lot of industry and domain-specific jargon. In these cases,
building a custom speech model makes sense. You do that by training with additional
data that's associated with the specific domain.
Depending on the size of the custom domain, it might also make sense to train multiple
models and compartmentalize a model for an individual application. For example,
Olympics commentators report on various sports, each with its own jargon. Because
each sport has a vocabulary that differs significantly from the others, building a custom
model specific to a sport increases accuracy by limiting the utterance data relative to
that particular sport. As a result, the model can learn from a precise and targeted set of
data.
The baseline model is appropriate when the audio is clear of ambient noise and
the transcribed speech consists of commonly spoken language.
A custom model augments the baseline model to include domain-specific
vocabulary that's shared across all areas of the custom domain.
Multiple custom models make sense when the custom domain has numerous
areas, each with a specific vocabulary.
Speech transcription for a specific domain, like medical transcription or call center
transcription
Live transcription, as in an app or to provide captions for live video streaming
Design considerations
This section describes some design considerations for building a speech-based
application.
Let's return to Olympics example. Say you need to include the transcription of audio
commentary for multiple sports, including ice hockey, luge, snowboarding, alpine skiing,
and more. Building a custom speech model for each sport will improve accuracy
because each sport has unique terminology. However, each model must have diverse
training data. It's too restrictive and inextensible to create a model for each
commentator for each sport. A more practical approach is to build a single model for
each sport but include audio from a group of that includes commentators with different
accents, of both genders, and of various ages. All domain-specific phrases related to the
sport as captured by the diverse commentators reside in the same model.
You also need to consider which languages and locales to support. It might make sense
to create these models by locale.
The final custom model can include datasets that use a combination of all three of the
customizations described in this section.
Train with numerous examples of phrases and utterances from the domain. For
example, include transcripts of cleaned and normalized alpine skiing event audio
and human-generated transcripts of previous events. Be sure that the transcripts
include the terms used in alpine skiing and multiple examples of how
commentators pronounce them. If you follow this process, the resulting custom
model should be able to recognize domain-specific words and phrases.
Train with specific data that focuses on problem areas. This approach works well
when there isn't much training data, for example, if new slang terms are used
during alpine skiing events and need to be included in the model. This type of
training uses the following approach:
Use Speech Studio to generate a transcription and compare it with human-
generated transcriptions.
Identify problem areas from patterns in what the commentators say. Identify:
The contexts within which the problem word or utterance is applied.
Different inflections and pronunciations of the word or utterance.
Any unique commentator-specific applications of the word or utterance.
Training a custom model with specific data can be time-consuming. Steps include
carefully analyzing the transcription gaps, manually adding training phrases, and
repeating this process multiple times. However, in the end, this approach provides
focused training for the problem areas that were previously incorrectly transcribed. And
it's possible to iteratively build this model by selectively training on critical areas and
then proceeding down the list in order of importance. Another benefit is that the
dataset size will include a few hundred utterances rather than a few thousand, even after
many iterations of building the training data.
Be aware of the difference between lexical text and display text. Speech Studio
produces WER based on lexical text. However, what the user sees is the display text
with punctuation, capitalization, and numerical words represented as numbers.
Following is an example of lexical text versus display text.
Lexical text: the speed is great and the time is even better fifty seven oh six three
seconds for the german
Display text: The speed is great. And that time is even better. 57063 seconds for
the German.
What's expected (implied) is: The speed is great. And that time is even better.
57.063 seconds for the German
The custom model has a low WER rate, but that doesn't mean that user-perceived
error rate (errors in display text) is low. This problem occurs mainly in alphanumeric
input because different applications can have alternative ways of representing the
input. You shouldn't rely only on the WER. You also need to review the final
recognition result.
When display text seems wrong, review the detailed recognition result from the
SDK, which includes lexical text, in which everything is spelled out. If the lexical text
is correct, the recognition is accurate. You can then resolve inaccuracies in the
display text (the final recognized result) by adding post-processing rules.
Manage datasets, models, and their versions. In Speech Studio, when you create
projects, datasets, and models, there are only two fields: name and description.
When you build datasets and models iteratively, you need to follow a good
naming and versioning scheme to make it easy to identify the contents of a
dataset and which model reflects which version of the dataset. For more details
about this recommendation, see Deploy a custom speech-to-text solution.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributors:
Next steps
What is Custom Speech?
What is text-to-speech?
Train a Custom Speech model
Deploy a custom speech-to-text solution
Related resources
Artificial intelligence (AI) architecture design
Use a speech-to-text transcription pipeline to analyze recorded conversations
Control IoT devices with a voice assistant app
Deploy a custom speech-to-text
solution
Azure AI services Azure AI Speech Azure Machine Learning
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more
information, such as potential use cases, alternative services, implementation
considerations, or pricing guidance, let us know by providing GitHub feedback .
This article is an implementation guide and example scenario that provides a sample
deployment of the solution that's described in Implement custom speech-to-text:
Architecture
Components
Azure Machine Learning is an enterprise-grade service for the end-to-end
machine learning lifecycle.
Azure Cognitive Services is a set of APIs, SDKs, and services that can help you
make your applications more intelligent, engaging, and discoverable.
Speech Studio is a set of UI-based tools for building and integrating features
from Cognitive Services Speech service into your applications. Here, it's one
alternative for training datasets. It's also used to review training results.
Speech-to-text REST API is an API that you can use to upload your own data,
test and train a custom model, compare accuracy between models, and deploy
a model to a custom endpoint. You can also use it to operationalize your model
creation, evaluation, and deployment.
Speech CLI is a command-line tool for using Speech service without having to
write any code. It provides another alternative for creating and training datasets
and for operationalizing your processes.
Scenario details
This article is based on the following fictional scenario:
Contoso, Ltd., is a broadcast media company that airs broadcasts and commentary on
Olympics events. As part of the broadcast agreement, Contoso provides event
transcription for accessibility and data mining.
Contoso wants to use the Azure Speech service to provide live subtitling and audio
transcription for Olympics events. Contoso employs female and male commentators
from around the world who speak with diverse accents. In addition, each individual sport
has specific terminology that can make transcription difficult. This article describes the
application development process for this scenario: providing subtitles for an application
that needs to deliver accurate event transcription.
1. Use Speech Studio, Azure Speech SDK, Speech CLI, or the REST API to generate
transcripts for spoken sentences and utterances.
2. Compare the generated transcript with the human-generated transcript.
3. If certain domain-specific words are transcribed incorrectly, consider creating a
custom speech model for that specific domain.
4. Review various options for creating custom models. Decide whether one or many
custom models will work better.
5. Collect training and testing data.
6. Ensure the data is in an acceptable format.
7. Train, test and evaluate, and deploy the model.
8. Use the custom model for transcription.
9. Operationalize the model building, evaluation, and deployment process.
1. Use Speech Studio, Azure Speech SDK, Speech CLI, or the REST API to generate
transcripts for spoken sentences and utterances
Azure Speech provides SDKs, a CLI interface, and a REST API for generating transcripts
from audio files or directly from microphone input. If the content is in an audio file, it
needs to be in a supported format. In this scenario, Contoso has previous event
recordings (audio and video) in .avi files. Contoso can use tools like FFmpeg to extract
audio from the video files and save it in a format that's supported by the Azure Speech
SDK, like .wav.
In the following code, the standard PCM audio codec, pcm_s16le , is used to extract
audio in a single channel (mono) that has a sampling rate of 8 KHz.
To perform the comparison, Contoso samples commentary audio from multiple sports
and uses Speech Studio to compare the human-generated transcript with the results
transcribed by Azure Speech service. The Contoso human-generated transcripts are in a
WebVTT format. To use these transcripts, Contoso cleans them up and generates a
simple .txt file that has normalized text without the timestamp information.
For information about using Speech Studio to create and evaluate a dataset, see
Training and testing datasets.
For more information about WER, see Evaluate word error rate.
Based on these results, the custom model (Olympics_Skiing_v6) is better than the base
model (20211030) for the dataset.
Note the Insertion and Deletion rates, which indicate that the audio file is relatively
clean and has low background noise.
Based on the results in the preceding table, for the base model, Model 1: 20211030,
about 10 percent of the words are substituted. In Speech Studio, use the detailed
comparison feature to identify domain-specific words that are missed. The following
table shows one section of the comparison.
she has dethroned the she has dethroned the she has dethroned the
olympic champion goggia olympic champion georgia olympic champion goggia
Model 1 doesn't recognize domain-specific words like the names of the athletes "Katia
Seizinger" and "Goggia." However, when the custom model is trained with data that
includes the athletes' names and other domain-specific words and phrases, it's able to
learn and recognize them.
4. Review various options for creating custom models. Decide whether one or many
custom models will work better
By experimenting with various ways to build custom models, Contoso found that they
could achieve better accuracy by using language and pronunciation model
customization. (See the first article in this guide.) Contoso also noted minor
improvements when they included acoustic (original audio) data for building the custom
model. However, the benefits weren't significant enough to make it worth maintaining
and training for a custom acoustic model.
Contoso found that creating separate custom language models for each sport (one
model for alpine skiing, one model for luge, one model for snowboarding, and so on)
provided better recognition results. They also noted that creating separate acoustic
models based on the type of sport to augment the language models wasn't necessary.
The Training and testing datasets article provides details about collecting the data
needed for training a custom model. Contoso collected transcripts for various Olympics
sports from diverse commentators and used language model adaptation to build one
model per sport type. However, they used one pronunciation file for all custom models
(one for each sport). Because the testing and training data are kept separate, after a
custom model was built, Contoso used event audio whose transcripts weren't included
in the training dataset for model evaluation.
As described in Training and testing datasets, datasets that are used to create a custom
model or to test the model need to be in a specific format. Contoso's data is in WebVTT
files. They created some simple tools to produce text files that contain normalized text
for language model adaptation.
New event recordings are used to further test and evaluate the trained model. It can
take a couple of iterations of testing and evaluation to fine-tune a model. Finally, when
the model generates transcripts that have acceptable error rates, it's deployed
(published) to be consumed from the SDK.
After the custom model is deployed, you can use the following C# code to use the
model in the SDK for transcription:
C#
region. You can get these values from the Azure portal by going to the resource
group where the Cognitive Services resource was created and looking at its keys.
After the custom model is published, it needs to be evaluated regularly and updated if
new vocabulary is added. Your business might evolve, and you might need more custom
models to increase coverage for more domains. The Azure Speech team also releases
new base models, which are trained on more data, as they become available.
Automation can help you keep up with these changes. The next section of this article
provides more details about automating the preceding steps.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributors:
Next steps
What is Custom Speech?
What is text-to-speech?
Train a Custom Speech model
Implement custom speech-to-text
Azure/custom-speech-stt on GitHub
Related resources
Artificial intelligence (AI) architecture design
Use a speech-to-text transcription pipeline to analyze recorded conversations
Control IoT devices with a voice assistant app
Implement custom speech-to-text
Conversation summarization
Azure AI services
Most businesses provide customer service support to help customers with product
queries, troubleshooting, and maintaining or upgrading features or the product itself. To
provide a satisfactory resolution, customer support specialists need to respond quickly
with accurate information. OpenAI can help organizations with customer support in a
variety of ways.
Conversation scenarios
Self-service chatbots (fully automated). In this scenario, customers can interact
with a chatbot that's powered by GPT-3 and trained on industry-specific data. The
chatbot can understand customer questions and answer appropriately based on
responses learned from a knowledge base.
Chatbot with agent intervention (semi-automated). Questions posed by
customers are sometimes complex and necessitate human intervention. In such
cases, GPT-3 can provide a summary of the customer-chatbot conversation and
help the agent with quick searches for additional information from a large
knowledge base.
Summarizing transcripts (fully automated or semi-automated). In most customer
support centers, agents are required to summarize conversations for record
keeping, future follow-up, training, and other internal processes. GPT-3 can
provide automated or semi-automated summaries that capture salient details of
conversations for further use.
This guide focuses on the process for summarizing transcripts by using Azure OpenAI
GPT-3.
Architecture
A typical architecture for a conversation summarizer has three main stages: pre-
processing, summarization, and post-processing. If the input contains a verbal
conversation or any form of speech, the speech needs to be transcribed to text. For
more information, see Azure Speech-to-text service .
Workflow
1. Gather input data: Feed relevant input data into the pipeline. If the source is an
audio file, you need to convert it to text by using a TTS service like Azure text-to-
speech.
2. Pre-process the data: Remove confidential information and any unimportant
conversation from the data.
3. Feed the data into the summarizer: Pass the data in a prompt via Azure OpenAI
APIs. In-context learning models include zero-shot, few-shot, or a custom model.
4. Generate a summary: The model generates a summary of the conversation.
5. Post-process the data: Apply a profanity filter and various validation checks to the
summary. Add sensitive or confidential data that was removed during the pre-
process step back into the summary.
6. Evaluate the results: Review and evaluate the results. This step can help you
identify areas where the model needs to be improved and find errors.
The following sections provide more details about the three main stages.
Pre-process
The goal of pre-processing is to ensure that the data provided to the summarizer service
is relevant and doesn't include sensitive or confidential information.
Here are some pre-processing steps that can help condition your raw data. You might
need to apply one or many steps, depending on the use case.
Remove personally identifiable information (PII). You can use the Conversational
PII API (preview) to remove PII from transcribed or written text. This example shows
the output after the API has removed PII:
Summarizer
OpenAI's text-completion API endpoint is called the completions endpoint. To start the
text-completion process, it requires a prompt. Prompt engineering is a process used in
large language models. The first part of the prompt includes natural language
instructions and/or examples of the specific task requested (in this scenario,
summarization). Prompts allow developers to provide some context to the API, which
can help it generate more relevant and accurate text completions. The model then
completes the task by predicting the most probable next text. This technique is known
as in-context learning.
7 Note
There are three main approaches for training models for in-context learning: zero-shot,
few-shot and fine-tuning. These approaches vary based on the amount of task-specific
data that's provided to the model.
Zero-shot: In this approach, no examples are provided to the model. The task
request is the only input. In zero-shot learning, the model relies on data that GPT-3
is already trained on (almost all available data from the internet). It attempts to
relate the given task to existing categories that it has already learned about and
responds accordingly.
Few-shot: When you use this approach, you include a small number of examples in
the prompt that demonstrate the expected answer format and the context. The
model is provided with a very small amount of training data, typically just a few
examples, to guide its predictions. Training with a small set of examples enables
the model to generalize and understand related but previously unseen tasks.
Creating these few-shot examples can be challenging because they need to clarify
the task you want the model to perform. One commonly observed problem is that
models, especially small ones, are sensitive to the writing style that's used in the
training examples.
The main advantages of this approach are a significant reduction in the need for
task-specific data and reduced potential to learn an excessively narrow distribution
from a large but narrow fine-tuning dataset.
With this approach, you can't update the weights of the pretrained model.
You can use this customization step to improve your process by:
Including a larger set of example data.
Using traditional optimization techniques with backpropagation to readjust the
weights of the model. These techniques enable higher quality results than the
zero-shot or few-shot approaches provide by themselves.
Improving the few-shot learning approach by training the model weights with
specific prompts and a specific structure. This technique enables you to achieve
better results on a wider number of tasks without needing to provide examples
in the prompt. The result is less text sent and fewer tokens.
Disadvantages include the need for a large new dataset for every task, the
potential for poor generalization out of distribution, and the possibility to exploit
spurious features of the training data, resulting in high chances of unfair
comparison with human performance.
Creating a dataset for model customization is different from designing prompts for
use with the other models. Prompts for completion calls often use either detailed
instructions or few-shot learning techniques and consist of multiple examples. For
fine-tuning, we recommend that each training example consists of a single input
example and its desired output. You don't need to provide detailed instructions or
examples in the prompt.
Post-process
We recommend that you check the validity of the results that you get from GPT-3.
Implement validity checks by using a programmatic approach or classifiers, depending
on the use case. Here are some critical checks:
Finally, reintroduce any vital information that was previously removed from the
summary, like confidential information.
In some cases, a summary of the conversation is also sent to the customer, along with
the original transcript. In these cases, post-processing involves appending the transcript
to the summary. It can also include adding lead-in sentences like "Please see the
summary below."
Considerations
It's important to fine-tune your base models with an industry-specific training dataset
and change the size of available datasets. Fine-tuned models perform best when the
training data includes at least 1,000 data points and the ground truth (human-generated
summaries) used to train the models is of high quality.
The tradeoff is cost. The process of labeling and cleaning datasets can be expensive. To
ensure high-quality training data, you might need to manually inspect ground truth
summaries and rewrite low-quality summaries. Consider the following points about the
summarization stage:
Prompt engineering: When provided with little instruction, Davinci often performs
better than other models. To optimize results, experiment with different prompts
for different models.
Token size: A summarizer that's based on GPT-3 is limited to a total of 4,098
tokens, including the prompt and completion. To summarize larger passages,
separate the text into parts that conform to these constraints. Summarize each part
individually and then collect the results in a final summary.
Garbage in, garbage out: Trained models are only as good as the training data that
you provide. Be sure that the ground truth summaries in the training data are well
suited to the information that you eventually want to summarize in your dialogs.
Stopping point: The model stops summarizing when it reaches a natural stopping
point or a stop sequence that you provide. Test this parameter to choose among
multiple summaries and to check whether summaries look incomplete.
Example scenario: Summarizing transcripts in
call centers
This scenario demonstrates how the Azure OpenAI summarization feature can help
customer service agents with summarization tasks. It tests the zero-shot, few-shot, and
fine-tuning approaches and compares the results against human-generated summaries.
Prompt Completion
Customer: Sweet.
Ideal output. The goal is to create summaries that follow this format: "Customer said x.
Agent responded y." Another goal is to capture salient features of the dialog, like the
customer complaint, suggested resolution, and follow-up actions.
Here's an example of a customer support interaction, followed by a comprehensive
human-written summary of it:
Dialog
Agent. I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.
Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off (or the best available
discounted price) to continue playing a game once it leaves the catalog.
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
Zero-shot
The zero-shot approach is useful when you don't have ample labeled training data. In
this case, there aren't enough ground truth summaries. It's important to design prompts
carefully to extract relevant information. The following format is used to extract general
summaries from customer-agent chats:
Python
rouge = Rouge()
# Run zero-shot prediction for all engines of interest
deploymentNames = ["curie-instruct","davinci-instruct"] # also known as
text-davinci/text-instruct
for deployment in deploymentNames:
url = openai.api_base + "openai/deployments/" + deployment + "/completions?
api-version=2022-12-01-preivew"
response_list = []
rouge_list = []
print("calling…" + deployment)
for i in range(len(test)):
response_i = openai.Completion.create(
engine = deployment,
prompt = build_prompt(prefix, [test['prompt'][i]], suffix),
temperature = 0.0,
max_tokens = 400,
top_p = 1.0,
frequence_penalty = 0.5,
persence_penalty = 0.0,
stop=["end"] # We recommend that you adjust the stop sequence based on the
dataset
)
scores = rouge.get_scores(normalize_text(response_i[‘choices’][ 0]
['text']),test['completion'][i])
rouge_list += [scores[0]['rouge-1']['f']],
response_list += [response_i]
summary_list = [normalize_text(i[‘choices’][0][‘text’]) for i in
response_list]
test[deployment + "_zeroshotsummary"] = summary_list
test[deployment + "_zeroshotroguescore"] = rouge_list
Dialog
Agent: I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.
Agent: Once a game leaves the Xbox game pass catalog you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.
Ground truth
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
Davinci result
The customer asked how long they could access games after they leave the Xbox game
pass catalog. The agent told them that they would need to purchase a digital copy of
the game from the Xbox app for windows or the Microsoft store in order to continue
playing it. The agent also reminded the customer that they would be notified prior to a
game leaving the Xbox game pass catalog. as a member the customer could purchase
any game in the catalog for up to 20 off or the best available discounted price.
Curie result
The customer asks the agent about the Xbox game pass. the agent tells the customer
that once a game leaves the Xbox game pass catalog the customer will need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game. The agent
also reminds the customer that Xbox will notify members prior to a game leaving the
Xbox game pass catalog.
Few-shot
When you use the few-shot approach, the model is provided with a small number of
examples.
summaries:"
Python
train_small = train[]
train_small_json = train_small.to_dict(orient='records')
compiled_train_prompt = build_prompt_fewshot(prefix,context_primer,
train_small_json, suffix)
Dialog
Agent: I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.
Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.
Ground truth
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
Davinci result
customer wanted to know how long they could access games after they leave game
pass. Agent informs that once a game leaves the Xbox game pass catalog the customer
would need to purchase a digital copy or obtain another form of entitlement to
continue playing the game.
Curie result
customer has a question about the game pass. customer is good. agent needs help with
the Xbox game pass. customer asks how long they can access the games after they
leave the game pass catalog. Agent informs that once a game leaves the Xbox game
pass catalog the customer will need to purchase a digital copy from the Xbox app for
windows or the Microsoft store play from a disc or obtain another form of entitlement
to continue playing the game. customer is happy to hear this and thanks agent.
Fine-tuning
Fine-tuning is the process of tailoring models to get a specific desired outcome from
your own datasets.
Dialog
Agent: I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.
Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.
Ground truth
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
Curie result
customer wants to know how long they can access the games after they leave game
pass. agent explains that once a game leaves the Xbox game pass catalog they'll need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game.
Conclusions
Generally, the Davinci model requires fewer instructions to perform tasks than other
models, such as Curie. Davinci is better suited for summarizing text that requires an
understanding of context or specific language. Because Davinci is the most complex
model, its latency is higher than that of other models. Curie is faster than Davinci and is
capable of summarizing conversations.
These tests suggest that you can generate better summaries when you provide more
instruction to the model via few-shot or fine-tuning. Fine-tuned models are better at
conforming to the structure and context learned from a training dataset. This capability
is especially useful when summaries are domain specific (for example, generating
summaries from a doctor's notes or online-prescription customer support). If you use
fine-tuning, you have more control over the types of summaries that you see.
For the sake of easy comparison, here's a summary of the results that are presented
earlier:
Ground truth
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
The customer asked how long they could access games after they leave the Xbox game
pass catalog. The agent told them that they would need to purchase a digital copy of
the game from the Xbox app for windows or the Microsoft store in order to continue
playing it. The agent also reminded the customer that they would be notified prior to a
game leaving the Xbox game pass catalog. As a member the customer could purchase
any game in the catalog for up to 20 off or the best available discounted price.
The customer asks the agent about the Xbox game pass. the agent tells the customer
that once a game leaves the Xbox game pass catalog the customer will need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game. The agent
also reminds the customer that Xbox will notify members prior to a game leaving the
Xbox game pass catalog.
customer has a question about the game pass. customer is good. agent needs help with
the Xbox game pass. customer asks how long they can access the games after they
leave the game pass catalog. Agent informs that once a game leaves the Xbox game
pass catalog the customer will need to purchase a digital copy from the Xbox app for
windows or the Microsoft store play from a disc or obtain another form of entitlement
to continue playing the game. customer is happy to hear this and thanks agent.
customer wants to know how long they can access the games after they leave game
pass. agent explains that once a game leaves the Xbox game pass catalog they'll need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game.
Evaluating summarization
There are multiple techniques for evaluating the performance of summarization models.
Here's an example:
Python
Here's an example:
Python
import torchmetrics
from torchmetrics.text.bert import BERTScore
preds = "You should have ice cream in the summer"
target = "Ice creams are great when the weather is hot"
bertscore = BERTScore()
score = bertscore(preds, target)
print(score)
Python
The first sentence, "The cat is on the porch by the tree," is referred to as the candidate.
The second sentence is referred as the reference. The command uses BERTScore to
compare the sentences and generate a matrix.
This following matrix displays the output that's generated by the preceding command:
Responsible use
GPT can produce excellent results, but you need to check the output for social, ethical,
and legal biases and harmful results. When you fine-tune models, you need to remove
any data points that might be harmful for the model to learn. You can use red teaming
to identify any harmful outputs from the model. You can implement this process
manually and support it by using semi-automated methods. You can generate test cases
by using language models and then use a classifier to detect harmful behavior in the
test cases. Finally, you should perform a manual check of generated summaries to
ensure that they're ready to be used.
For more information, see Red Teaming Language Models with Language Models .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributor:
Next steps
More information about Azure OpenAI
ROUGE reference article
Training module: Introduction to Azure OpenAI Service
Learning path: Develop AI solutions with Azure OpenAI
Related resources
Query-based document summarization
Choose a Microsoft cognitive services technology
Natural language processing technology
Machine learning operations (MLOps)
v2
Article • 08/31/2023
This article describes three Azure architectures for machine learning operations. They all
have end-to-end continuous integration (CI), continuous delivery (CD), and retraining
pipelines. The architectures are for these AI applications:
The architectures are the product of the MLOps v2 project. They incorporate the best
practices that the solution architects discovered in the process of creating multiple
machine learning solutions. The result is deployable, repeatable, and maintainable
patterns as described here.
For an implementation with sample deployment templates for MLOps v2, see Azure
MLOps (v2) solution accelerator on GitHub.
Simulations, deep reinforcement learning, and other forms of AI aren't covered by this
article.
Architecture
The MLOps v2 architectural pattern is made up of four main modular elements that
represent these phases of the MLOps lifecycle:
Data estate
Administration and setup
Model development (inner loop)
Model deployment (outer loop)
These elements, the relationships between them, and the personas typically associated
with them are common for all MLOps v2 scenario architectures. There can be variations
in the details of each, depending on the scenario.
The base architecture for MLOps v2 for Machine Learning is the classical machine
learning scenario on tabular data. The CV and NLP architectures build on and modify
this base architecture.
Current architectures
The architectures currently covered by MLOps v2 and discussed in this article are:
1. Data estate
This element illustrates the data estate of the organization, and potential data
sources and targets for a data science project. Data engineers are the primary
owners of this element of the MLOps v2 lifecycle. The Azure data platforms in this
diagram are neither exhaustive nor prescriptive. The data sources and targets that
represent recommended best practices based on the customer use case are
indicated by a green check mark.
This element is the first step in the MLOps v2 accelerator deployment. It consists of
all tasks related to creation and management of resources and roles associated
with the project. These can include the following tasks, and perhaps others:
a. Creation of project source code repositories
b. Creation of Machine Learning workspaces by using Bicep, ARM, or Terraform
c. Creation or modification of datasets and compute resources that are used for
model development and deployment
d. Definition of project team users, their roles, and access controls to other
resources
e. Creation of CI/CD pipelines
f. Creation of monitors for collection and notification of model and infrastructure
metrics
The primary persona associated with this phase is the infrastructure team, but
there can also be data engineers, machine learning engineers, and data scientists.
3. Model development (inner loop)
The inner loop element consists of your iterative data science workflow that acts
within a dedicated, secure Machine Learning workspace. A typical workflow is
illustrated in the diagram. It proceeds from data ingestion, exploratory data
analysis, experimentation, model development and evaluation, to registration of a
candidate model for production. This modular element as implemented in the
MLOps v2 accelerator is agnostic and adaptable to the process your data science
team uses to develop models.
Personas associated with this phase include data scientists and machine learning
engineers.
After the data science team develops a model that's a candidate for deploying to
production, the model can be registered in the Machine Learning workspace
registry. CI pipelines that are triggered, either automatically by model registration
or by gated human-in-the-loop approval, promote the model and any other model
dependencies to the model deployment phase.
Personas associated with this stage are typically machine learning engineers.
Personas associated with this phase are primarily machine learning engineers.
The staging and test phase can vary with customer practices but typically includes
operations such as retraining and testing of the model candidate on production
data, test deployments for endpoint performance, data quality checks, unit testing,
and responsible AI checks for model and data bias. This phase takes place in one
or more dedicated, secure Machine Learning workspaces.
7. Production deployment
After a model passes the staging and test phase, it can be promoted to production
by using a human-in-the-loop gated approval. Model deployment options include
a managed batch endpoint for batch scenarios or, for online, near-real-time
scenarios, either a managed online endpoint or Kubernetes deployment by using
Azure Arc. Production typically takes place in one or more dedicated, secure
Machine Learning workspaces.
8. Monitoring
Monitoring in staging, test, and production makes it possible for you to collect
metrics for, and act on, changes in performance of the model, data, and
infrastructure. Model and data monitoring can include checking for model and
data drift, model performance on new data, and responsible AI issues.
Infrastructure monitoring can watch for slow endpoint response, inadequate
compute capacity, or network problems.
Based on criteria for model and data matters of concern such as metric thresholds
or schedules, automated triggers and notifications can implement appropriate
actions to take. This can be regularly scheduled automated retraining of the model
on newer production data and a loopback to staging and test for pre-production
evaluation. Or, it can be due to triggers on model or data issues that require a
loopback to the model development phase where data scientists can investigate
and potentially develop a new model.
1. Data estate
This element illustrates the data estate of the organization and potential data
sources and targets for a data science project. Data engineers are the primary
owners of this element of the MLOps v2 lifecycle. The Azure data platforms in this
diagram are neither exhaustive nor prescriptive. Images for CV scenarios can come
from many different data sources. For efficiency when developing and deploying
CV models with Machine Learning, recommended Azure data sources for images
are Azure Blob Storage and Azure Data Lake Storage.
This element is the first step in the MLOps v2 accelerator deployment. It consists of
all tasks related to creation and management of resources and roles associated
with the project. For CV scenarios, administration and setup of the MLOps v2
environment is largely the same as for classical machine learning, but with an
additional step: create image labeling and annotation projects by using the
labeling feature of Machine Learning or another tool.
The inner loop element consists of your iterative data science workflow performed
within a dedicated, secure Machine Learning workspace. The primary difference
between this workflow and the classical machine learning scenario is that image
labeling and annotation is a key element of this development loop.
After the data science team develops a model that's a candidate for deploying to
production, the model can be registered in the Machine Learning workspace
registry. CI pipelines that are triggered either automatically by model registration
or by gated human-in-the-loop approval promote the model and any other model
dependencies to the model deployment phase.
The staging and test phase can vary with customer practices but typically includes
operations such as test deployments for endpoint performance, data quality
checks, unit testing, and responsible AI checks for model and data bias. For CV
scenarios, retraining of the model candidate on production data can be omitted
due to resource and time constraints. Instead, the data science team can use
production data for model development, and the candidate model that's
registered from the development loop is the model that's evaluated for
production. This phase takes place in one or more dedicated, secure Machine
Learning workspaces.
7. Production deployment
After a model passes the staging and test phase, it can be promoted to production
via human-in-the-loop gated approvals. Model deployment options include a
managed batch endpoint for batch scenarios or, for online, near-real-time
scenarios, either a managed online endpoint or Kubernetes deployment by using
Azure Arc. Production typically takes place in one or more dedicated, secure
Machine Learning workspaces.
8. Monitoring
Monitoring in staging, test, and production makes it possible for you to collect
metrics for, and act on, changes in the performance of the model, data, and
infrastructure. Model and data monitoring can include checking for model
performance on new images. Infrastructure monitoring can watch for slow
endpoint response, inadequate compute capacity, or network problems.
The data and model monitoring and event and action phases of MLOps for NLP
are the key differences from classical machine learning. Automated retraining is
typically not done in CV scenarios when model performance degradation on new
images is detected. In this case, new images for which the model performs poorly
must be reviewed and annotated by a human-in-the-loop process, and often the
next action goes back to the model development loop for updating the model
with the new images.
1. Data estate
This element illustrates the organization data estate and potential data sources
and targets for a data science project. Data engineers are the primary owners of
this element of the MLOps v2 lifecycle. The Azure data platforms in this diagram
are neither exhaustive nor prescriptive. Data sources and targets that represent
recommended best practices based on the customer use case are indicated by a
green check mark.
This element is the first step in the MLOps v2 accelerator deployment. It consists of
all tasks related to creation and management of resources and roles associated
with the project. For NLP scenarios, administration and setup of the MLOps v2
environment is largely the same as for classical machine learning, but with an
additional step: create image labeling and annotation projects by using the
labeling feature of Machine Learning or another tool.
The inner loop element consists of your iterative data science workflow performed
within a dedicated, secure Machine Learning workspace. The typical NLP model
development loop can be significantly different from the classical machine learning
scenario in that annotators for sentences and tokenization, normalization, and
embeddings for text data are the typical development steps for this scenario.
After the data science team develops a model that's a candidate for deploying to
production, the model can be registered in the Machine Learning workspace
registry. CI pipelines that are triggered either automatically by model registration
or by gated human-in-the-loop approval promote the model and any other model
dependencies to the model deployment phase.
The staging and test phase can vary with customer practices, but typically includes
operations such as retraining and testing of the model candidate on production
data, test deployments for endpoint performance, data quality checks, unit testing,
and responsible AI checks for model and data bias. This phase takes place in one
or more dedicated, secure Machine Learning workspaces.
7. Production deployment
After a model passes the staging and test phase, it can be promoted to production
by a human-in-the-loop gated approval. Model deployment options include a
managed batch endpoint for batch scenarios or, for online, near-real-time
scenarios, either a managed online endpoint or Kubernetes deployment by using
Azure Arc. Production typically takes place in one or more dedicated, secure
Machine Learning workspaces.
8. Monitoring
Monitoring in staging, test, and production makes it possible for you to collect and
act on changes in performance of the model, data, and infrastructure. Model and
data monitoring can include checking for model and data drift, model
performance on new text data, and responsible AI issues. Infrastructure monitoring
can watch for issues such as slow endpoint response, inadequate compute
capacity, and network problems.
As with the CV architecture, the data and model monitoring and event and action
phases of MLOps for NLP are the key differences from classical machine learning.
Automated retraining isn't typically done in NLP scenarios when model
performance degradation on new text is detected. In this case, new text data for
which the model performs poorly must be reviewed and annotated by a human-in-
the-loop process. Often the next action is to go back to the model development
loop to update the model with the new text data.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
What is Azure Pipelines?
Azure Arc overview
What is Azure Machine Learning?
Data in Azure Machine Learning
Azure MLOps (v2) solution accelerator
End-to-end machine learning operations (MLOps) with Azure Machine Learning
Introduction to Azure Data Lake Storage Gen2
Azure DevOps documentation
GitHub Docs
Azure Synapse Analytics documentation
Azure Event Hubs documentation
Related resources
Choose a Microsoft cognitive services technology
Natural language processing technology
Compare the machine learning products and technologies from Microsoft
How Azure Machine Learning works: resources and assets (v2)
What are Azure Machine Learning pipelines?
Machine learning operations (MLOps) framework to upscale machine learning
lifecycle with Azure Machine Learning
What is the Team Data Science Process?
MLOps for Python models using
Azure Machine Learning
Azure Blob Storage Azure Container Registry Azure DevOps Azure Machine Learning Azure Pipelines
Architecture
Azure Pipelines. This build and test system is based on Azure DevOps and used for the
build and release pipelines. Azure Pipelines breaks these pipelines into logical steps
called tasks. For example, the Azure CLI task makes it easier to work with Azure
resources.
Azure Machine Learning is a cloud service for training, scoring, deploying, and
managing machine learning models at scale. This architecture uses the Azure Machine
Learning Python SDK to create a workspace, compute resources, the machine learning
pipeline, and the scoring image. An Azure Machine Learning workspace provides the
space in which to experiment, train, and deploy machine learning models.
Azure Machine Learning pipelines provide reusable machine learning workflows that
can be reused across scenarios. Training, model evaluation, model registration, and
image creation occur in distinct steps within these pipelines for this use case. The
pipeline is published or updated at the end of the build phase and gets triggered on
new data arrival.
Azure Blob Storage. Blob containers are used to store the logs from the scoring service.
In this case, both the input data and the model prediction are collected. After some
transformation, these logs can be used for model retraining.
Azure Container Registry. The scoring Python script is packaged as a Docker image and
versioned in the registry.
Azure Container Instances. As part of the release pipeline, the QA and staging
environment is mimicked by deploying the scoring webservice image to Container
Instances, which provides an easy, serverless way to run a container.
Azure Kubernetes Service. Once the scoring webservice image is thoroughly tested in
the QA environment, it is deployed to the production environment on a managed
Kubernetes cluster.
Build pipeline
The CI pipeline gets triggered every time code is checked in. It publishes an updated
Azure Machine Learning pipeline after building the code and running a suite of tests.
The build pipeline consists of the following tasks:
Code quality. These tests ensure that the code conforms to the standards of the
team.
Unit test. These tests make sure the code works, has adequate code coverage, and
is stable.
Data test. These tests verify that the data samples conform to the expected
schema and distribution. Customize this test for other use cases and run it as a
separate data sanity pipeline that gets triggered as new data arrives. For example,
move the data test task to a data ingestion pipeline so you can test it earlier.
7 Note
You should consider enabling DevOps practices for the data used to train the
machine learning models, but this is not covered in this article. For more
information about the architecture and best practices for CI/CD of a data ingestion
pipeline, see DevOps for a data ingestion pipeline.
The following one-time tasks occur when setting up the infrastructure for Azure
Machine Learning and the Python SDK:
Create the workspace that hosts all Azure Machine Learning-related resources.
Create the compute resources that run the training job.
Create the machine learning pipeline with the updated training script.
Publish the machine learning pipeline as a REST endpoint to orchestrate the
training workflow. The next section describes this step.
Retraining pipeline
The machine learning pipeline orchestrates the process of retraining the model in an
asynchronous manner. Retraining can be triggered on a schedule or when new data
becomes available by calling the published pipeline REST endpoint from the previous
step.
Train model. The training Python script is executed on the Azure Machine Learning
Compute resource to get a new model file which is stored in the run history. Since
training is the most compute-intensive task in an AI project, the solution uses
Azure Machine Learning Compute.
Evaluate model. A simple evaluation test compares the new model with the
existing model. Only when the new model is better does it get promoted.
Otherwise, the model is not registered and the pipeline is canceled.
Register model. The retrained model is registered with the Azure ML Model
registry. This service provides version control for the models along with metadata
tags so they can be easily reproduced.
Release pipeline
This pipeline shows how to operationalize the scoring image and promote it safely
across different environments. This pipeline is subdivided into two environments, QA
and production:
QA environment
Model Artifact trigger. Release pipelines get triggered every time a new artifact is
available. A new model registered to Azure Machine Learning Model Management
is treated as a release artifact. In this case, a pipeline is triggered for each new
model is registered.
Create a scoring image. The registered model is packaged together with a scoring
script and Python dependencies (Conda YAML file ) into an operationalization
Docker image. The image automatically gets versioned through Azure Container
Registry.
Test web service. A simple API test makes sure the image is successfully deployed.
Production environment
Deploy on Azure Kubernetes Service. This service is used for deploying a scoring
image as a web service at scale in a production environment.
Test web service. A simple API test makes sure the image is successfully deployed.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Scalability
A build pipeline on Azure DevOps can be scaled for applications of any size. Build
pipelines have a maximum timeout that varies depending on the agent they are run on.
Builds can run forever on self-hosted agents (private agents). For Microsoft-hosted
agents for a public project, builds can run for six hours. For private projects, the limit is
30 minutes.
To use the maximum timeout, set the following property in your Azure Pipelines YAML
file:
YAML
jobs:
- job: <job_name>
timeoutInMinutes: 0
Ideally, have your build pipeline finish quickly and execute only unit tests and a subset
of other tests. This allows you to validate the changes quickly and fix them if issues arise.
Run long-running tests during off-hours.
The release pipeline publishes a real-time scoring web service. A release to the QA
environment is done using Container Instances for convenience, but you can use
another Kubernetes cluster running in the QA/staging environment.
Scale the production environment according to the size of your Azure Kubernetes
Service cluster. The size of the cluster depends on the load you expect for the deployed
scoring web service. For real-time scoring architectures, throughput is a key
optimization metric. For non-deep learning scenarios, the CPU should be sufficient to
handle the load; however, for deep learning workloads, when speed is a bottleneck,
GPUs generally provide better performance compared to CPUs. Azure Kubernetes
Service supports both CPU and GPU node types, which is the reason this solution uses it
for image deployment. For more information, see GPUs vs CPUs for deployment of deep
learning models.
Scale the retraining pipeline up and down depending on the number of nodes in your
Azure Machine Learning Compute resource, and use the autoscaling option to manage
the cluster. This architecture uses CPUs. For deep learning workloads, GPUs are a better
choice and are supported by Azure Machine Learning Compute.
Management
Monitor retraining job. Machine learning pipelines orchestrate retraining across a
cluster of machines and provide an easy way to monitor them. Use the Azure
Machine Learning UI and look under the pipelines section for the logs.
Alternatively, these logs are also written to blob and can be read from there as well
using tools such as Azure Storage Explorer .
Logging. Azure Machine Learning provides an easy way to log at each step of the
machine learning life cycle. The logs are stored in a blob container. For more
information, see Enable logging in Azure Machine Learning. For richer monitoring,
configure Application Insights to use the logs.
Security. All secrets and credentials are stored in Azure Key Vault and accessed in
Azure Pipelines using variable groups.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Azure DevOps is free for open-source projects and small projects with up to five
users. For larger teams, purchase a plan based on the number of users.
Compute is the biggest cost driver in this architecture and its cost varies depending on
the use case. This architecture uses Azure Machine Learning Compute, but other options
are available. Azure Machine Learning does not add any surcharge on top of the cost of
the virtual machines backing your compute cluster. Configure your compute cluster to
have a minimum of 0 nodes, so that when not in use, it can scale down to 0 nodes and
not incur any costs. The compute cost depends on the node type, a number of nodes,
and provisioning mode (low-priority or dedicated). You can estimate the cost for
Machine Learning and other services using the Azure pricing calculator .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Want to learn more? Check out the related learning path, Start the machine
learning lifecycle with MLOps.
Secure MLOps solutions with
Azure network security
Azure DevOps Azure DNS Azure Machine Learning Azure Private Link Azure Virtual Network
This article describes how to help protect MLOps solutions by using Azure network
security capabilities such as Azure Virtual Network, network peering, Azure Private Link,
and Azure DNS. It also introduces how to use:
Finally, this article describes the costs of using the network security services.
Architecture
Azure Blob Azure Machine Azure Container Azure Key
Storage Learning workspace Registry Vault User
Azure VPN
Gateway
Self-hosted
Compute Compute Azure Pipelines
agent
instance cluster
Virtual network
peering
BASTION VNET
10.2.0.0/16
Azure Kubernetes cluster
AML VNET
10.1.0.0/16
Resource group
Microsoft
Azure
Dataflow
The architecture diagram shows a sample MLOps solution.
The virtual network named AML VNET helps protect the Azure Machine Learning
workspace and its associated resources.
The jump host, Azure Bastion, and self-hosted agents belong to another virtual
network named BASTION VNET. This arrangement simulates having another
solution that requires access to the resources within AML VNET.
With the support of virtual network peering and private DNS zones, Azure
Pipelines can execute on self-host agents and trigger the Azure Machine Learning
pipelines that are published in the Azure Machine Learning workspace to train,
evaluate, and register the machine learning models.
Finally, the model is deployed to online endpoints or batch endpoints that are
supported by Azure Machine Learning compute or Azure Kubernetes Service
clusters.
Components
The sample MLOps solution consists of these components:
This example scenario also uses the following services to help protect the MLOps
solution:
Scenario details
MLOps is a set of practices at the intersection of Machine Learning, DevOps, and data
engineering that aims to deploy and maintain machine learning models in production
reliably and efficiently.
The following diagram shows a simplified MLOps process model. This model offers a
solution that automates data preparation, model training, model evaluation, model
registration, model deployment, and monitoring.
When you implement an MLOps solution, you might want to help secure these
resources:
DevOps pipelines
Machine learning training data
Machine learning pipelines
Machine learning models
Network security
Use Virtual Network to partially or fully isolate the environment from the public
internet to reduce the attack surface and the potential for data exfiltration.
In the Azure Machine Learning workspace, if you're still using Azure Machine
Learning CLI v1 and Azure Machine Learning Python SDK v1 (such as v1 API),
add a private endpoint to the workspace to provide network isolation for
everything except CRUD operations on the workspace or compute resources.
To take advantage of the new features of an Azure Machine Learning
workspace, use Azure Machine Learning CLI v2 and Azure Machine Learning
Python SDK v2 (such as v2 API), in which enabling a private endpoint on your
workspace doesn't provide the same level of network isolation. However, the
virtual network will still help protect the training data and machine learning
models. We recommend you evaluate v2 API before adopting it in your
enterprise solutions. See What is the new API platform on Azure Resource
Manager for more information.
Data encryption
Encrypt training data in transit and at rest by using platform-managed or
customer-managed access keys.
The Azure Machine Learning workspace is the top-level resource for Azure Machine
Learning and the core component of an MLOps solution. The workspace provides a
centralized place to work with all the artifacts that you create when you use Azure
Machine Learning.
When you create a new workspace, it automatically creates the following Azure
resources that are used by the workspace:
An engine manufacturer needs a more secure solution to help protect the data and
machine learning models of its factories and products for its system that uses
computer vision to detect defects in parts.
The MLOps solutions for these scenarios and others might use Azure Machine Learning
workspaces, Azure Blob Storage, Azure Kubernetes Service, Container Registry, and
other Azure services.
You can use all or part of this example for any similar scenario that has an MLOps
environment that's deployed on Azure and uses Azure security capabilities to help
protect the relevant resources. The original customer for this solution is in the
telecommunications industry.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that improve the quality of a workload when applied. For
more information, see Microsoft Azure Well-Architected Framework.
Security
Security provides more assurances against deliberate attacks and the abuse of your
valuable data and systems. For more information, see Overview of the security pillar.
Consider how to help secure your MLOps solution beginning with the architecture
design. Development environments might not need significant security, but it's
important in the staging and production environments.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
Configuring Virtual Network is free of charge, but there are charges for the other
services that your scenario might require, such as private links, DNS zones, and virtual
network peering. The following table describes the charges for those services and others
that might be required.
Private Link Pay only for private endpoint resource hours and the data that is
processed through your private endpoint.
Azure DNS, private Billing is based on the number of DNS zones that are hosted in
zone Azure and the number of DNS queries that are received.
Virtual Network Inbound and outbound traffic is charged at both ends of the peered
peering networks.
VPN gateway Charges are based on the amount of time that the gateway is
provisioned and available.
Azure Bastion Billing involves a combination of hourly pricing that is based on SKU,
scale units, and data transfer rates.
Operational excellence
Operational excellence covers the operations processes that deploy an application and
keep it running in production. For more information, see Overview of the operational
excellence pillar.
To streamline continuous integration and continuous delivery (CI/CD), the best practice
is to use tools and services for infrastructure as code (IaC), such as Terraform or Azure
Resource Manager templates, Azure DevOps, and Azure Pipelines.
Deploy this scenario
The following sections describe how to deploy, access, and help secure resources in this
example scenario.
Virtual Network
The first step in helping to secure the MLOps environment is to help protect the Azure
Machine Learning workspace and its associated resources. An effective method of
protection is to use Virtual Network. Virtual Network is the fundamental building block
for your private network in Azure. Virtual Network lets many types of Azure resources
more securely communicate with each other, the internet, and on-premises networks.
Putting the Azure Machine Learning workspace and its associated resources into a
virtual network helps ensure that components can communicate with each other
without exposing them to the public internet. Doing so reduces their attack surface and
helps to prevent data exfiltration.
The following Terraform snippet shows how to create a compute cluster for Azure
Machine Learning, attach it to a workspace, and put it into a subnet of a virtual network.
Terraform
In this example scenario, there are four private endpoints that are tied to Azure PaaS
options and are managed by a subnet in AML VNET, as shown in the architecture
diagram. Therefore, these services are only accessible to the resources within the same
virtual network, AML VNET. Those services are:
The following Terraform snippet shows how to use a private endpoint to link to an Azure
Machine Learning workspace, which is more protected by the virtual network as a result.
The snippet also shows use of a private DNS zone, which is described in Private DNS
zones.
Terraform
identity {
type = "SystemAssigned"
}
}
resource "azurerm_private_dns_zone_virtual_network_link"
"ws_zone_notebooks_link" {
name = "ws_zone_link_notebooks"
resource_group_name = "my_resource_group"
private_dns_zone_name = azurerm_private_dns_zone.ws_zone_notebooks.name
virtual_network_id = azurerm_virtual_network.aml_vnet.id
}
private_service_connection {
name = "my_aml_ws_psc"
private_connection_resource_id =
azurerm_machine_learning_workspace.aml_ws.id
subresource_names = ["amlworkspace"]
is_manual_connection = false
}
private_dns_zone_group {
name = "private-dns-zone-group-ws"
private_dns_zone_ids = [azurerm_private_dns_zone.ws_zone_api.id,
azurerm_private_dns_zone.ws_zone_notebooks.id]
}
This sample solution uses private endpoints for the Azure Machine Learning workspace
and for its associated resources such as Azure Storage, Azure Key Vault, or Container
Registry. Therefore, you must configure your DNS settings to resolve the IP addresses of
the private endpoints from the fully qualified domain name (FQDN) of the connection
string.
You can link a private DNS zone to a virtual network to resolve specific domains.
The Terraform snippet in Private Link and Azure Private Endpoint creates two private
DNS zones by using the zone names that are recommended in Azure services DNS zone
configuration:
privatelink.api.azureml.ms
privatelink.notebooks.azure.net
The following Terraform snippet sets up virtual network peering between AML VNET and
BASTION VNET.
Terraform
Use self-hosted agents in the same virtual network or the peering virtual network,
as shown in the architecture diagram.
Use Azure-hosted agents and add their IP address ranges to an allowlist in the
firewall settings of the targeted Azure services.
Each of these choices has pros and cons. The following table compares Azure-hosted
agents with self-hosted agents.
Cost Start free for one parallel Start free for one parallel job with
job with 1,800 minutes per unlimited minutes per month and a
month and a charge for charge for each extra self-hosted CI/CD
each Azure-hosted CI/CD parallel job with unlimited minutes.
parallel job. This option offers less-expensive
parallel jobs.
Maintenance Taken care of for you by Maintained by you with more control
Microsoft. over installing the software you like.
Build Time More time consuming Saves time because it keeps all your
because it completely files and caches.
refreshes every time you
start a build, and you
always build from scratch.
7 Note
Based on the comparisons in the table and the considerations of security and
complexity, this example scenario uses a self-hosted agent for Azure Pipelines to trigger
Azure Machine Learning pipelines in the virtual network.
Install the agent on a Docker container. This option isn't feasible, because this
scenario might require running the Docker container within the agent for machine
learning model training.
The following sample code provisions two self-hosted agents by creating Azure VMs
and extensions:
Terraform
settings = <<SETTINGS
{
"script":
"${base64encode(templatefile("../scripts/terraform/agent_init.sh", {
AGENT_USERNAME = "${var.AGENT_USERNAME}",
ADO_PAT = "${var.ADO_PAT}",
ADO_ORG_SERVICE_URL = "${var.ADO_ORG_SERVICE_URL}",
AGENT_POOL = "${var.AGENT_POOL}"
}))}"
}
SETTINGS
}
As shown in the preceding code block, the Terraform script calls agent_init.sh, shown in
the following code block, to install agent software and required libraries on the agent
VM per the customer's requirements.
Bash
#!/bin/sh
# Install other required libraries
...
# Unattended installation
sudo runuser -l ${AGENT_USERNAME} -c '/myagent/config.sh --unattended --url
${ADO_ORG_SERVICE_URL} --auth pat --token ${ADO_PAT} --pool ${AGENT_POOL}'
cd /myagent
#Configure as a service
sudo ./svc.sh install ${AGENT_USERNAME}
#Start service
sudo ./svc.sh start
In this example scenario, to ensure the self-hosted agent can access the container
registry in the virtual network, we use virtual network peering and add a virtual network
link to link the private DNS zone, privatelink.azurecr.io, to BASTION VNET. The following
Terraform snippet shows the implementation.
Terraform
private_service_connection {
name = "acr_psc"
private_connection_resource_id = azurerm_container_registry.acr.id
subresource_names = ["registry"]
is_manual_connection = false
}
private_dns_zone_group {
name = "private-dns-zone-group-app-acr"
private_dns_zone_ids = [azurerm_private_dns_zone.acr_zone.id]
}
}
This example scenario also ensures that the container registry has a contributor role for
the system-assigned managed identity of the Azure Machine Learning workspace.
Also note that for the compute cluster or instance, it's now possible to remove the
public IP address, which helps provide better protection for compute resources in the
MLOps solution. For more information, see No public IP for compute instances.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Other contributors:
Next steps
Terraform on Azure documentation
Azure Machine Learning Enterprise Terraform Example
Azure MLOps (v2) solution accelerator
Azure Virtual Network pricing
Pricing for Azure DevOps
Related resources
Machine learning operations (MLOps) framework to upscale machine learning
lifecycle with Azure Machine Learning
Secure an Azure Machine Learning workspace with virtual networks
Azure Pipelines agents
Machine Learning operations
maturity model
Azure Machine Learning
The purpose of this maturity model is to help clarify the Machine Learning Operations
(MLOps) principles and practices. The maturity model shows the continuous
improvement in the creation and operation of a production level machine learning
application environment. You can use it as a metric for establishing the progressive
requirements needed to measure the maturity of a machine learning production
environment and its associated processes.
Maturity model
The MLOps maturity model helps clarify the Development Operations (DevOps)
principles and practices necessary to run a successful MLOps environment. It's intended
to identify gaps in an existing organization's attempt to implement such an
environment. It's also a way to show you how to grow your MLOps capability in
increments rather than overwhelm you with the requirements of a fully mature
environment. Use it as a guide to:
As with most maturity models, the MLOps maturity model qualitatively assesses
people/culture, processes/structures, and objects/technology. As the maturity level
increases, the probability increases that incidents or errors will lead to improvements in
the quality of the development and production processes.
The tables that follow identify the detailed characteristics for that level of process
maturity. The model will continue to evolve. This version was last updated in January
2020.
Level 0: No MLOps
People Model Creation Model Release Application
Integration
Next steps
Learning path: Introduction to machine learning operations (MLOps)
Training module: Start the machine learning lifecycle with MLOps
MLOps: Model management, deployment, and monitoring with Azure Machine
Learning
What are Azure Machine Learning pipelines?
Related resources
Machine learning operations (MLOps) framework to upscale machine learning
lifecycle with Azure Machine Learning
Orchestrate MLOps by using Azure Databricks
Secure MLOps solutions with Azure network security
MLOps for Python models using Azure Machine Learning
Machine learning operations
(MLOps) framework to upscale
machine learning lifecycle with
Azure Machine Learning
Azure Data Factory Azure Machine Learning
This client project helped a Fortune 500 food company improve its demand forecasting. The
company ships products directly to multiple retail outlets. The improvement helped them
optimize the stocking of their products in different stores across several regions of the United
States. To achieve this, Microsoft's Commercial Software Engineering (CSE) team worked with
the client's data scientists on a pilot study to develop customized machine learning models for
the selected regions. The models take into account:
Shopper demographics
Historical and forecasted weather
Past shipments
Product returns
Special events
The goal to optimize stocking represented a major component of the project and the client
realized a significant sales lift in the early field trials. Also, the team saw a 40% reduction in
forecasting mean absolute percentage error (MAPE) when compared with a historical average
baseline model.
A key part of the project was figuring out how to scale up the data science workflow from the
pilot study to a production level. This production-level workflow required the CSE team to:
The typical data science workflow today is closer to a one-off lab environment than a
production workflow. An environment for data scientists must be suitable for them to:
Most tools that are used for these tasks have specific purposes and aren't well suited to
automation. In a production level machine learning operation, there must be more
consideration given to application lifecycle management and DevOps.
The CSE team helped the client scale up the operation to production levels. They implemented
various aspects of continuous integration (CI)/continuous delivery (CD) capabilities and
addressed issues like observability, and integration with Azure capabilities. During the
implementation, the team uncovered gaps in existing MLOps guidance. Those gaps needed to
be filled so that MLOps was better understood and applied at scale.
Understanding MLOps practices helps organizations ensure that the machine learning models
that the system produces are production quality models that improve business performance.
When MLOps is implemented, the organization no longer has to spend as much of their time
on low-level details relating to the infrastructure and engineering work that's required to
develop and run machine learning models for production level operations. Implementing
MLOps also helps the data science and software engineering communities to learn to work
together to deliver a production-ready system.
The CSE team used this project to address machine learning community needs by addressing
issues like developing an MLOps maturity model. These efforts were aimed at improving
MLOps adoption by understanding the typical challenges of the key players in the MLOps
process.
Engagement scenario
The client delivers products directly to retail market outlets on a regular schedule. Each retail
outlet varies in its product usage patterns, so product inventory needs to vary in each weekly
delivery. Maximizing sales and minimizing product returns and lost sales opportunities are the
goals of the demand forecasting methodologies that the client uses. This project focused on
using machine learning to improve the forecasts.
The CSE team divided the project into two phases. Phase 1 focused on developing machine
learning models to support a field-based pilot study on the effectiveness of machine learning
forecasting for a selected sales region. The success of Phase 1 led to Phase 2, in which the
team scaled up the initial pilot study from a minimal group of models that supported a single
geographic region to a set of sustainable production-level models for all of the client's sales
regions. A primary consideration for the scaled up solution was the need to accommodate the
large number of geographic regions and their local retail outlets. The team dedicated the
machine learning models to both large and small retail outlets in each region.
The Phase 1 pilot study determined that a model dedicated to one region's retail outlets could
use local sales history, local demographics, weather, and special events to optimize the
demand forecast for the outlets in the region. Four ensemble machine learning forecasting
models served market outlets in a single region. The models processed data in weekly batches.
Also, the team developed two baseline models using historical data for comparison.
For the first version of the scaled up Phase 2 solution, the CSE team selected 14 geographic
regions to participate, including small and large market outlets. They used more than 50
machine learning forecasting models. The team expected further system growth and continued
refinement of the machine learning models. It quickly became clear that this wider-scaled
machine learning solution is sustainable only if it's based on the best practice principles of
DevOps for the machine learning environment.
categorical
embeddings
Same as
above for an
additional 13
geographic
regions
Environment Market Format Models Model Model
Region Subdivision Description
Same as
above for the
prod
environment
The MLOps process provided a framework for the scaled up system that addressed the full
lifecycle of the machine learning models. The framework includes development, testing,
deployment, operation, and monitoring. It fulfills the needs of a classic CI/CD process.
However, because of its relative immaturity compared to DevOps, it became evident that
existing MLOps guidance had gaps. The project team worked to fill in some of those gaps.
They wanted to provide a functional process model that insures the viability of the scaled up
machine learning solution.
The MLOps process that was developed from this project made a significant real-world step to
move MLOps to a higher level of maturity and viability. The new process is directly applicable
to other machine learning projects. The CSE team used what they learned to build a draft of an
MLOps maturity model that anyone can apply to other machine learning projects.
Technical scenario
MLOps, also known as DevOps for machine learning, is an umbrella term that encompasses
philosophies, practices, and technologies that are related to implementing machine learning
lifecycles in a production environment. It's still a relatively new concept. There have been many
attempts to define what MLOps is and many people have questioned whether MLOps can
subsume everything from how data scientists prepare data to how they ultimately deliver,
monitor, and evaluate machine learning results. While DevOps has had years to develop a set
of fundamental practices, MLOps is still early in its development. As it evolves, we discover the
challenges of bringing together two disciplines that often operate with different skill sets and
priorities: software/ops engineering, and data science.
Implementing MLOps in real-world production environments has unique challenges that must
be overcome. Teams can use Azure to support MLOps patterns. Azure can also provide clients
with asset management and orchestration services for effectively managing the machine
learning lifecycle. Azure services are the foundation for the MLOps solution that we describe in
this article.
Initial experimental models that were developed in Jupyter notebooks and implemented
in Python.
7 Note
Teams used the same machine learning approach for large and small stores, but the
training and scoring data depended on the size of the store.
Model retraining whenever code or data changes, or the model goes stale.
Model performance in scoring that's considered significant when MAPE <= 45% when
compared with a historical average baseline model.
MLOps requirements
The team had to meet several key requirements to scale up the solution from the Phase 1 pilot
field study, in which only a few models were developed for a single sales region. Phase 2
implemented custom machine learning models for multiple regions. The implementation
included:
Weekly batch processing for large and small stores in each region to retrain the models
with new datasets.
7 Note
This represents a shift in how data scientists and data engineers have commonly
worked in the past.
A unique model that represented each region for large and small stores based on store
history, demographics, and other key variables. The model had to process the entire
dataset to minimize the risk of processing error.
The ability to initially scale up to support 14 sales regions with plans to scale up further.
Plans for additional models for longer term forecasting for regions and other store
clusters.
Deploy Model here can represent any operational use of the validated machine learning model.
Compared to DevOps, MLOps presents the additional challenge of integrating the machine
learning lifecycle into the typical CI/CD process.
The data science lifecycle doesn't follow the typical software development lifecycle. It includes
the use of Azure Machine Learning to train and score the models, so these steps had to be
included in the CI/CD automation.
Batch processing of data is the basis of the architecture. Two Azure Machine Learning pipelines
are central to the process, one for training and the other for scoring. This diagram shows the
data science methodology that was used for the initial phase of the client project:
The team tested several algorithms. They ultimately chose an ensemble design of a LASSO
linear regression model and a neural network with categorical embeddings. The team used the
same model, defined by the level of product that the client could store on site, for both large
and small stores. The team further subdivided the model into fast-moving and slow-moving
products.
The data scientists train the machine learning models when the team releases new code and
when new data is available. Training typically happens weekly. Consequently, each processing
run involves a large amount of data. Because the team collects the data from many sources in
different formats, it requires conditioning to put the data into a consumable format before the
data scientists can process it. The data conditioning requires significant manual effort and the
CSE team identified it as a primary candidate for automation.
As mentioned, the data scientists developed and applied the experimental Azure Machine
Learning models to a single sales region in the Phase 1 pilot field study to evaluate the
usefulness of this forecasting approach. The CSE team judged that the sales lift for the stores in
the pilot study was significant. This success justified applying the solution to full production
levels in Phase 2, starting with 14 geographic regions and thousands of stores. The team could
then use the same pattern to add additional regions.
The pilot model served as the basis for the scaled up solution, but the CSE team knew that the
model needed further refinement on a continuing basis to improve its performance.
MLOps solution
As MLOps concepts mature, teams often discover challenges in bringing the data science and
DevOps disciplines together. The reason is that the principal players in the disciplines, software
engineers and data scientists, operate with different skill sets and priorities.
But there are similarities to build on. MLOps, like DevOps, is a development process
implemented by a toolchain. The MLOps toolchain includes such things as:
Version control
Code analysis
Build automation
Continuous integration
Testing frameworks and automation
Compliance policies integrated into CI/CD pipelines
Deployment automation
Monitoring
Disaster recovery and high availability
Package and container management
As noted above, the solution takes advantage of existing DevOps guidance, but is augmented
to create a more mature MLOps implementation that meets the needs of the client and of the
data science community. MLOps builds on DevOps guidance with these additional
requirements:
Data and model versioning isn't the same as code versioning: There must be versioning
of datasets as the schema and origin data changes.
Digital audit trail requirements: Track all changes when dealing with code and client
data.
Generalization: Models are different than code for reuse, since data scientists must tune
models based on input data and scenario. To reuse a model for a new scenario, you may
need to fine-tune/transfer/learn on it. You need the training pipeline.
Stale models: Models tend to decay over time and you need the ability to retrain them
on demand to ensure they remain relevant in production.
MLOps challenges
Traceable lineage of both code and data is also needed to diagnose model issues and create
reproducible models. Custom dashboards can make sense of how deployed models are
performing and indicate when to intervene. The team created such dashboards for this project.
Much of the pilot field test focused on conditioning the raw data so that the machine learning
model could process it. In an MLOps system, the team should automate this process, and track
the outputs.
Level Description
0 No Ops
2 Automated training
For the current version of the MLOps maturity model, see the MLOps maturity model article.
Data conditioning
Model training
Model testing and evaluation
Build definition and pipeline
Release pipeline
Deployment
Scoring
The Experiment phase is unique to the data science lifecycle, which reflects how data scientists
traditionally do their work. It differs from how code developers do their work. The following
diagram illustrates this lifecycle in more detail.
Integrating this data development process into MLOps poses a challenge. Here you see the
pattern that the team used to integrate the process into a form that MLOps can support:
The role of MLOps is to create a coordinated process that can efficiently support the large-
scale CI/CD environments that are common in production level systems. Conceptually, the
MLOps model must include all process requirements from experimentation to scoring.
The CSE team refined the MLOps process to fit the client's specific needs. The most notable
need was batch processing instead of real-time processing. As the team developed the scaled
up system, they identified and resolved some shortcomings. The most significant of these
shortcomings led to the development of a bridge between Azure Data Factory and Azure
Machine Learning, which the team implemented by using a built-in connector in Azure Data
Factory. They created this component set to facilitate the triggering and status monitoring
necessary to make the process automation work.
Another fundamental change was that the data scientists needed the capability to export
experimental code from Jupyter notebooks into the MLOps deployment process rather than
trigger training and scoring directly.
) Important
Scoring is the final step. The process runs the machine learning model to make
predictions. This addresses the basic business use case requirement for demand
forecasting. The team rates the quality of the predictions using the MAPE, which is a
measure of prediction accuracy of statistical forecasting methods and a loss function for
regression problems in machine learning. In this project, the team considered a MAPE of
<= 45% significant.
You can consider the MLOps process data flow shown above as an archetype framework for
projects that make similar architectural choices.
Running this full set of steps on the machine learning environment is expensive and time
consuming. As a result, the team did basic model validation tests locally on a development
machine. It ran the steps above and used the following:
Local testing dataset: A small dataset, often one that's obfuscated, that's checked in to
the repository and consumed as the input data source.
Local flag: A flag or argument in the model's code that indicates that the code intends
the dataset to run locally. The flag tells the code to bypass any call to the machine
learning environment.
This goal of these validation tests isn't to evaluate the performance of the trained model.
Rather, it's to validate that the code for the end-to-end process is of good quality. It assures
the quality of the code that's pushed upstream, like the incorporation of model validation tests
in the PR and CI build. It also makes it possible for engineers and data scientists to put
breakpoints into the code for debugging purposes.
Scoring CD pipeline
The scoring CD pipeline is applicable for the batch inference scenario, where the same model
orchestrator that's used for model validation triggers the published scoring pipeline.
Data scientist: Creates the machine learning model and its algorithms.
Engineer
Data engineer: Handles data conditioning.
Software engineer: Handles model integration into the asset package and the CI/CD
workflow.
Operations or IT: Oversees system operations.
Business stakeholder: Concerned with the predictions made by the machine learning
model and how they help the business.
Data end user: Consumes model output in some way that aids in making business
decisions.
The team had to address three key findings from the persona and role studies:
Data scientists and engineers have a mismatch of approach and skills in their work.
Making it easy for the data scientist and the engineer to collaborate is a major
consideration for the design of the MLOps process flow. It requires new skill acquisitions
by all team members.
There's a need to unify all of the principal personas without alienating anyone. A way to
do this is to:
Make sure they understand the conceptual model for MLOps.
Agree on the team members that will work together.
Establish working guidelines to achieve common goals.
If the business stakeholder and data end user need a way to interact with the data output
from the models, a user-friendly UI is the standard solution.
Other teams will certainly come across similar issues in other machine learning projects as they
scale up for production use.
Logical architecture
The data comes from many sources in many different formats, so it's conditioned before it's
inserted into the data lake. The conditioning is done by using microservices operating as Azure
Functions. The clients customize the microservices to fit the data sources and transform them
into a standardized csv format that the training and scoring pipelines consume.
System architecture
Solution overview
Azure Data Factory does the following:
Triggers an Azure Function to start data ingestion and a run of the Azure Machine
Learning pipeline.
Launches a durable function to poll the Azure Machine Learning pipeline for completion.
Custom dashboards in Power BI display the results. Other Azure dashboards that are
connected to SQL Azure, Azure Monitor, and App Insights via OpenCensus Python SDK, track
Azure resources. These dashboards provide information about the health of the machine
learning system. They also yield data that the client uses for product order forecasting.
Model orchestration
Model orchestration follows these steps:
1. When a PR is submitted, DevOps triggers a code validation pipeline.
2. The pipeline runs unit tests, code quality tests, and model validation tests.
3. When merged into the main branch, the same code validation tests are run, and DevOps
packages the artifacts.
4. DevOps collecting of artifacts triggers Azure Machine Learning to do:
a. Data validation.
b. Training validation.
c. Scoring validation.
5. After validation completes, the final scoring pipeline runs.
6. Changing data and submitting a new PR triggers the validation pipeline again, followed
by the final scoring pipeline.
Enable experimentation
As mentioned, the traditional data science machine learning lifecycle doesn't support the
MLOps process without modification. It uses different kinds of manual tools and
experimentation, validation, packaging, and model handoff that can't be easily scaled for an
effective CI/CD process. MLOps demands a high level of process automation. Whether a new
machine learning model is being developed or an old one is modified, it's necessary to
automate the lifecycle of the machine learning model. In the Phase 2 project, the team used
Azure DevOps to orchestrate and republish Azure Machine Learning pipelines for training
tasks. The long-running main branch performs basic testing of models, and pushes stable
releases through the long-running release branch.
Source control becomes an important part of this process. Git is the version control system
that's used to track notebook and model code. It also supports process automation. The basic
workflow that's implemented for source control applies the following principles:
The team explored, but did not implement, an option to carry the process forward to the point
where it would support having many real-time models running in production to service a given
request. This option can accommodate the use of ensemble models in A/B testing and
interleaved experiments.
End-user interfaces
The team developed end-user UIs for observability, monitoring, and instrumentation. As
mentioned, dashboards visually display the machine learning model data. These dashboards
show the following data in a user-friendly format:
7 Note
Stale models are scoring runs where the data scientists trained the model used for scoring
more than 60 days from when scoring took place. the Scoring page of the ML Monitor
dashboard displays this health metric.
Components
Azure Machine Learning
Azure Machine Learning Compute
Azure Machine Learning Pipelines
Azure Machine Learning Model Registry
Azure Blob Storage
Azure Data Lake Storage
Azure Pipelines
Azure Data Factory
Azure Functions for Python
Azure Monitor
Logs
Application Insights
Azure SQL Database
Azure Dashboards
Power BI
Considerations
Here you'll find a list of considerations to explore. They're based on the lessons the CSE team
learned during the project.
Environment considerations
Data scientists develop most of their machine learning models by using Python, often
starting with Jupyter notebooks. It can be a challenge to implement these notebooks as
production code. Jupyter notebooks are more of an experimental tool, while Python
scripts are more appropriate for production. Teams often need to spend time refactoring
model creation code into Python scripts.
Make clients who are new to DevOps and machine learning aware that experimentation
and production require different rigor, so it's good practice to separate the two.
Tools like the Azure Machine Learning Visual Designer or AutoML can be effective in
getting basic models off the ground while the client ramps up on standard DevOps
practices to apply to the rest of the solution.
Azure DevOps has plug-ins that can integrate with Azure Machine Learning to help
trigger pipeline steps. The MLOpsPython repo has a few examples of such pipelines.
Machine learning often requires powerful GPU machines for training. If the client doesn't
already have such hardware available, Azure Machine Learning compute clusters can
provide an effective path for quickly provisioning cost-effective powerful hardware that
autoscales. If a client has advanced security or monitoring needs, there are other options
such as standard VMs, Databricks, or local compute.
For a client to be successful, their model building teams (data scientists) and deployment
teams (DevOps engineers) need to have a strong communication channel. They can
accomplish this with daily stand-up meetings or a formal online chat service. Both
approaches help in integrating their development efforts in an MLOps framework.
Going from a notebook experiment to repeatable scripts is a rough transition for many
data scientists. The sooner you can get them writing their training code in Python scripts
the easier it will be for them to begin versioning their training code and enabling
retraining.
That isn't the only possible method. Databricks supports scheduling notebooks as jobs.
But, based on current client experience, this approach is difficult to instrument with full
DevOps practices because of testing limitations.
It's also important to understand what metrics are being used to consider a model a
success. Accuracy alone is often not good enough to determine the overall performance
of one model versus another.
Compute considerations
Customers should consider using containers to standardize their compute environments.
Nearly all Azure Machine Learning compute targets support using Docker . Having a
container handle the dependencies can reduce friction significantly, especially if the team
uses many compute targets.
Next steps
Learn more about MLOps
MLOps on Azure
Azure Monitor Visualizations
Machine Learning Lifecycle
Azure DevOps Machine Learning extension
Azure Machine Learning CLI
Trigger applications, processes, or CI/CD workflows based on Azure Machine Learning
events
Set up model training and deployment with Azure DevOps
Set up MLOps with Azure Machine Learning and Databricks
Related resources
MLOps maturity model
Orchestrate MLOps on Azure Databricks using Databricks Notebook
MLOps for Python models using Azure Machine Learning
Data science and machine learning with Azure Databricks
Citizen AI with the Power Platform
Deploy AI and machine learning computing on-premises and to the edge
What is the Team Data Science
Process?
Azure Machine Learning
The Team Data Science Process (TDSP) is an agile, iterative data science methodology to
deliver predictive analytics solutions and intelligent applications efficiently. TDSP helps
improve team collaboration and learning by suggesting how team roles work best
together. TDSP includes best practices and structures from Microsoft and other industry
leaders to help toward successful implementation of data science initiatives. The goal is
to help companies fully realize the benefits of their analytics program.
This article provides an overview of TDSP and its main components. We provide a
generic description of the process here that can be implemented with different kinds of
tools. A more detailed description of the project tasks and roles involved in the lifecycle
of the process is provided in additional linked topics. Guidance on how to implement
the TDSP using a specific set of Microsoft tools and infrastructure that we use to
implement the TDSP in our teams is also provided.
If you are using another data science lifecycle, such as CRISP-DM , KDD, or your
organization's own custom process, you can still use the task-based TDSP in the context
of those development lifecycles. At a high level, these different methodologies have
much in common.
This lifecycle has been designed for data science projects that ship as part of intelligent
applications. These applications deploy machine learning or artificial intelligence models
for predictive analytics. Exploratory data science projects or improvised analytics
projects can also benefit from using this process. But in such cases some of the steps
described may not be needed.
The lifecycle outlines the major stages that projects typically execute, often iteratively:
Business Understanding
Data Acquisition and Understanding
Modeling
Deployment
The goals, tasks, and documentation artifacts for each stage of the lifecycle in TDSP are
described in the Team Data Science Process lifecycle topic. These tasks and artifacts are
associated with project roles:
Solution architect
Project manager
Data engineer
Data scientist
Application developer
Project lead
The following diagram provides a grid view of the tasks (in blue) and artifacts (in green)
associated with each stage of the lifecycle (on the horizontal axis) for these roles (on the
vertical axis).
We provide templates for the folder structure and required documents in standard
locations. This folder structure organizes the files that contain code for data exploration
and feature extraction, and that record model iterations. These templates make it easier
for team members to understand work done by others and to add new members to
teams. It is easy to view and update document templates in markdown format. Use
templates to provide checklists with key questions for each project to insure that the
problem is well defined and that deliverables meet the quality expected. Examples
include:
a project charter to document the business problem and scope of the project
data reports to document the structure and statistics of the raw data
model reports to document the derived features
model performance metrics such as ROC curves or MSE
The analytics and storage infrastructure, where raw and processed datasets are stored,
may be in the cloud or on-premises. This infrastructure enables reproducible analysis. It
also avoids duplication, which may lead to inconsistencies and unnecessary
infrastructure costs. Tools are provided to provision the shared resources, track them,
and allow each team member to connect to those resources securely. It is also a good
practice to have project members create a consistent compute environment. Different
team members can then replicate and validate experiments.
Here is an example of a team working on multiple projects and sharing various cloud
analytics infrastructure components.
Next steps
Team Data Science Process: Roles and tasks Outlines the key personnel roles and their
associated tasks for a data science team that standardizes on this process.
The Team Data Science Process lifecycle
Article • 11/15/2022
The Team Data Science Process (TDSP) provides a recommended lifecycle that you can
use to structure your data-science projects. The lifecycle outlines the complete steps
that successful projects follow. If you use another data-science lifecycle, such as the
Cross Industry Standard Process for Data Mining (CRISP-DM) , Knowledge Discovery in
Databases (KDD) , or your organization's own custom process, you can still use the
task-based TDSP.
This lifecycle is designed for data-science projects that are intended to ship as part of
intelligent applications. These applications deploy machine learning or artificial
intelligence models for predictive analytics. Exploratory data-science projects and
improvised analytics projects can also benefit from the use of this process. But for those
projects, some of the steps described here might not be needed.
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Next steps
For examples of how to execute steps in TDSPs that use Azure Machine Learning, see
Use the TDSP with Azure Machine Learning.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Related resources
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
The business understanding stage of the
Team Data Science Process lifecycle
Article • 11/15/2022
This article outlines the goals, tasks, and deliverables associated with the business
understanding stage of the Team Data Science Process (TDSP). This process provides a
recommended lifecycle that you can use to structure your data-science projects. The
lifecycle outlines the major stages that projects typically execute, often iteratively:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Goals
Specify the key variables that are to serve as the model targets and whose related
metrics are used determine the success of the project.
Identify the relevant data sources that the business has access to or needs to
obtain.
How to do it
There are two main tasks addressed in this stage:
Define objectives: Work with your customer and other stakeholders to understand
and identify the business problems. Formulate questions that define the business
goals that the data science techniques can target.
Identify data sources: Find the relevant data that helps you answer the questions
that define the objectives of the project.
Define objectives
1. A central objective of this step is to identify the key business variables that the
analysis needs to predict. We refer to these variables as the model targets, and we
use the metrics associated with them to determine the success of the project. Two
examples of such targets are sales forecasts or the probability of an order being
fraudulent.
2. Define the project goals by asking and refining "sharp" questions that are relevant,
specific, and unambiguous. Data science is a process that uses names and numbers
to answer such questions. You typically use data science or machine learning to
answer five types of questions:
Determine which of these questions you're asking and how answering it achieves
your business goals.
3. Define the project team by specifying the roles and responsibilities of its members.
Develop a high-level milestone plan that you iterate on as you discover more
information.
4. Define the success metrics. For example, you might want to achieve a customer
churn prediction. You need an accuracy rate of "x" percent by the end of this three-
month project. With this data, you can offer customer promotions to reduce churn.
The metrics must be SMART:
Specific
Measurable
Achievable
Relevant
Time-bound
Data that's relevant to the question. Do you have measures of the target and
features that are related to the target?
Data that's an accurate measure of your model target and the features of interest.
For example, you might find that the existing systems need to collect and log additional
kinds of data to address the problem and achieve the project goals. In this situation, you
might want to look for external data sources or update your systems to collect new data.
Artifacts
Here are the deliverables in this stage:
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Here are links to each step in the lifecycle of the TDSP:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Data acquisition and understanding
stage of the Team Data Science Process
Article • 11/15/2022
This article outlines the goals, tasks, and deliverables associated with the data
acquisition and understanding stage of the Team Data Science Process (TDSP). This
process provides a recommended lifecycle that you can use to structure your data-
science projects. The lifecycle outlines the major stages that projects typically execute,
often iteratively:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Goals
Produce a clean, high-quality data set whose relationship to the target variables is
understood. Locate the data set in the appropriate analytics environment so you
are ready to model.
Develop a solution architecture of the data pipeline that refreshes and scores the
data regularly.
How to do it
There are three main tasks addressed in this stage:
After you're satisfied with the quality of the cleansed data, the next step is to better
understand the patterns that are inherent in the data. This data analysis helps you
choose and develop an appropriate predictive model for your target. Look for evidence
for how well connected the data is to the target. Then determine whether there is
sufficient data to move forward with the next modeling steps. Again, this process is
often iterative. You might need to find new data sources with more accurate or more
relevant data to augment the data set initially identified in the previous stage.
In this stage, you develop a solution architecture of the data pipeline. You develop the
pipeline in parallel with the next stage of the data science project. Depending on your
business needs and the constraints of your existing systems into which this solution is
being integrated, the pipeline can be one of the following options:
Batch-based
Streaming or real time
A hybrid
Artifacts
The following are the deliverables in this stage:
Data quality report : This report includes data summaries, the relationships
between each attribute and target, variable ranking, and more.
Solution architecture: The solution architecture can be a diagram or description of
your data pipeline that you use to run scoring or predictions on new data after you
have built a model. It also contains the pipeline to retrain your model based on
new data. Store the document in the Project directory when you use the TDSP
directory structure template.
Checkpoint decision: Before you begin full-feature engineering and model
building, you can reevaluate the project to determine whether the value expected
is sufficient to continue pursuing it. You might, for example, be ready to proceed,
need to collect more data, or abandon the project as the data does not exist to
answer the question.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Modeling stage of the Team Data
Science Process lifecycle
Article • 11/15/2022
This article outlines the goals, tasks, and deliverables associated with the modeling stage
of the Team Data Science Process (TDSP). This process provides a recommended
lifecycle that you can use to structure your data-science projects. The lifecycle outlines
the major stages that projects typically execute, often iteratively:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Goals
Determine the optimal data features for the machine-learning model.
Create an informative machine-learning model that predicts the target most
accurately.
Create a machine-learning model that's suitable for production.
How to do it
There are three main tasks addressed in this stage:
Feature engineering: Create data features from the raw data to facilitate model
training.
Model training: Find the model that answers the question most accurately by
comparing their success metrics.
Determine if your model is suitable for production.
Feature engineering
Feature engineering involves the inclusion, aggregation, and transformation of raw
variables to create the features used in the analysis. If you want insight into what is
driving a model, then you need to understand how the features relate to each other and
how the machine-learning algorithms are to use those features.
This step requires a creative combination of domain expertise and the insights obtained
from the data exploration step. Feature engineering is a balancing act of finding and
including informative variables, but at the same time trying to avoid too many unrelated
variables. Informative variables improve your result; unrelated variables introduce
unnecessary noise into the model. You also need to generate these features for any new
data obtained during scoring. As a result, the generation of these features can only
depend on data that's available at the time of scoring.
Model training
Depending on the type of question that you're trying to answer, there are many
modeling algorithms available. For guidance on choosing a prebuilt algorithm with
designer, see Machine Learning Algorithm Cheat Sheet for Azure Machine Learning
designer; other algorithms are available through open-source packages in R or Python.
Although this article focuses on Azure Machine Learning, the guidance it provides is
useful for any machine-learning projects.
See Train models with Azure Machine Learning for options on training models in Azure
Machine Learning.
7 Note
Avoid leakage: You can cause data leakage if you include data from outside the
training data set that allows a model or machine-learning algorithm to make
unrealistically good predictions. Leakage is a common reason why data scientists
get nervous when they get predictive results that seem too good to be true. These
dependencies can be hard to detect. To avoid leakage often requires iterating
between building an analysis data set, creating a model, and evaluating the
accuracy of the results.
Model Evaluation
After training, the data scientist focuses next on model evaluation.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Here are links to each step in the lifecycle of the TDSP:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Deployment stage of the Team Data
Science Process lifecycle
Article • 11/15/2022
This article outlines the goals, tasks, and deliverables associated with the deployment of
the Team Data Science Process (TDSP). This process provides a recommended lifecycle
that you can use to structure your data-science projects. The lifecycle outlines the major
stages that projects typically execute, often iteratively:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Goal
Deploy models with a data pipeline to a production or production-like environment for
final user acceptance.
How to do it
The main task addressed in this stage:
Operationalize a model
After you have a set of models that perform well, you can operationalize them for other
applications to consume. Depending on the business requirements, predictions are
made either in real time or on a batch basis. To deploy models, you expose them with an
open API interface. The interface enables the model to be easily consumed from various
applications, such as:
Online websites
Spreadsheets
Dashboards
Line-of-business applications
Back-end applications
For examples of model operationalization with Azure Machine Learning, see Deploy
machine learning models to Azure. It is a best practice to build telemetry and
monitoring into the production model and the data pipeline that you deploy. This
practice helps with subsequent system status reporting and troubleshooting.
Artifacts
A status dashboard that displays the system health and key metrics
A final modeling report with deployment details
A final solution architecture document
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Mark Tabladillo | Senior Cloud Solution Architect
Next steps
Here are links to each step in the lifecycle of the TDSP:
1. Business understanding
2. Data Acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
For Azure, we recommend applying TDSP using Azure Machine Learning: for an
overview of Azure Machine Learning see What is Azure Machine Learning?.
Customer acceptance stage of the Team
Data Science Process lifecycle
Article • 11/15/2022
This article outlines the goals, tasks, and deliverables associated with the customer
acceptance stage of the Team Data Science Process (TDSP). This process provides a
recommended lifecycle that you can use to structure your data-science projects. The
lifecycle outlines the major stages that projects typically execute, often iteratively:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
Goal
Finalize project deliverables: Confirm that the pipeline, the model, and their
deployment in a production environment satisfy the customer's objectives.
How to do it
There are two main tasks addressed in this stage:
System validation: Confirm that the deployed model and pipeline meet the
customer's needs.
Project hand-off: Hand the project off to the entity that's going to run the system
in production.
The customer should validate that the system meets their business needs and that it
answers the questions with acceptable accuracy to deploy the system to production for
use by their client's application. All the documentation is finalized and reviewed. The
project is handed-off to the entity responsible for operations. This entity might be, for
example, an IT or customer data-science team or an agent of the customer that's
responsible for running the system in production.
Artifacts
The main artifact produced in this final stage is the Exit report of the project for the
customer. This technical report contains all the details of the project that are useful for
learning about how to operate the system. TDSP provides an Exit report template. You
can use the template as is, or you can customize it for specific client needs.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Here are links to each step in the lifecycle of the TDSP:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
For Azure, we recommend applying TDSP using Azure Machine Learning: for an
overview of Azure Machine Learning see What is Azure Machine Learning?.
Team Data Science Process roles and
tasks
Article • 12/06/2022
The Team Data Science Process (TDSP) is a framework developed by Microsoft that
provides a structured methodology to efficiently build predictive analytics solutions and
intelligent applications. This article outlines the key personnel roles and associated tasks
for a data science team standardizing on this process.
This introductory article links to tutorials on how to set up the TDSP environment. The
tutorials provide detailed guidance for using Azure DevOps Projects, Azure Repos
repositories, and Azure Boards. The motivating goal is moving from concept through
modeling and into deployment.
The tutorials use Azure DevOps because that is how to implement TDSP at Microsoft.
Azure DevOps facilitates collaboration by integrating role-based security, work item
management and tracking, and code hosting, sharing, and source control. The tutorials
also use an Azure Data Science Virtual Machine (DSVM) as the analytics desktop,
which has several popular data science tools pre-configured and integrated with
Microsoft software and Azure services.
You can use the tutorials to implement TDSP using other code-hosting, agile planning,
and development tools and environments, but some features may not be available.
In such a structure, there are group leads and team leads. Typically, a data science
project is done by a data science team. Data science teams have project leads for
project management and governance tasks, and individual data scientists and engineers
to perform the data science and data engineering parts of the project. The initial project
setup and governance is done by the group, team, or project leads.
1. Group Manager: Manages the entire data science unit in an enterprise. A data
science unit might have multiple teams, each of which is working on multiple data
science projects in distinct business verticals. A Group Manager might delegate
their tasks to a surrogate, but the tasks associated with the role do not change.
2. Team Lead: Manages a team in the data science unit of an enterprise. A team
consists of multiple data scientists. For a small data science unit, the Group
Manager and the Team Lead might be the same person.
3. Project Lead: Manages the daily activities of individual data scientists on a specific
data science project.
7 Note
Depending on the structure and size of an enterprise, a single person may play
more than one role, or more than one person may fill a role.
For detailed instructions, see Group Manager tasks for a data science team.
Team Lead tasks
The Team Lead or a designated project administrator completes the following tasks to
adopt the TDSP:
For detailed instructions, see Team Lead tasks for a data science team.
Creates a project repository in the team project, and seeds it from the project
template repository.
Optionally creates Azure file storage to store the project's data assets.
Optionally mounts the Azure file storage to the DSVM and adds project data
assets to it.
Sets up security control by adding project members and configuring their
permissions.
For detailed instructions, see Project Lead tasks for a data science team.
For detailed instructions for onboarding onto a project, see Project Individual
Contributor tasks for a data science team.
The following figure outlines the TDSP workflow for project execution:
For detailed instructions on project execution workflow, see Agile development of data
science projects.
Principal author:
Next steps
Explore more detailed descriptions of the roles and tasks defined by the Team Data
Science Process:
Related resources
Team Data Science Process group manager tasks
Tasks for the team lead on a Team Data Science Process team
Project lead tasks in the Team Data Science Process
Tasks for an individual contributor in the Team Data Science Process
Team Data Science Process group
manager tasks
Article • 11/15/2022
This article describes the tasks that a group manager completes for a data science
organization. The group manager manages the entire data science unit in an enterprise.
A data science unit may have several teams, each of which is working on many data
science projects in distinct business verticals. The group manager's objective is to
establish a collaborative group environment that standardizes on the Team Data Science
Process (TDSP). For an outline of all the personnel roles and associated tasks handled by
a data science team standardizing on the TDSP, see Team Data Science Process roles
and tasks.
The following diagram shows the six main group manager setup tasks. Group managers
may delegate their tasks to surrogates, but the tasks associated with the role don't
change.
7 Note
This article uses Azure DevOps to set up a TDSP group environment, because that
is how to implement TDSP at Microsoft. If your group uses other code hosting or
development platforms, the Group Manager's tasks are the same, but the way to
complete them may be different.
Create an organization and project in Azure
DevOps
1. Go to visualstudio.microsoft.com , select Sign in at upper right, and sign into
your Microsoft account.
If you don't have a Microsoft account, select Sign up now, create a Microsoft
account, and sign in using this account. If your organization has a Visual Studio
subscription, sign in with the credentials for that subscription.
2. After you sign in, at upper right on the Azure DevOps page, select Create new
organization.
3. If you're prompted to agree to the Terms of Service, Privacy Statement, and Code
of Conduct, select Continue.
4. In the signup dialog, name your Azure DevOps organization and accept the host
region assignment, or drop down and select a different region. Then select
Continue.
5. Under Create a project to get started, enter GroupCommon, and then select
Create project.
1. On the GroupCommon project Summary page, select Repos. This action takes you
to the default GroupCommon repository of the GroupCommon project, which is
currently empty.
2. At the top of the page, drop down the arrow next to GroupCommon and select
Manage repositories.
3. On the Project Settings page, select the ... next to GroupCommon, and then select
Rename repository.
4. In the Rename the GroupCommon repository popup, enter GroupProjectTemplate,
and then select Rename.
2. At the top of the page, drop down the arrow next to GroupProjectTemplate and
select New repository.
3. In the Create a new repository dialog, select Git as the Type, enter GroupUtilities
as the Repository name, and then select Create.
4. On the Project Settings page, select Repositories under Repos in the left
navigation to see the two group repositories: GroupProjectTemplate and
GroupUtilities.
Import the Microsoft TDSP team repositories
In this part of the tutorial, you import the contents of the ProjectTemplate and Utilities
repositories managed by the Microsoft TDSP team into your GroupProjectTemplate and
GroupUtilities repositories.
1. From the GroupCommon project home page, select Repos in the left navigation.
The default GroupProjectTemplate repo opens.
4. At the top of the Repos page, drop down and select the GroupUtilities repository.
Each of your two group repositories now contains all the files, except those in the .git
directory, from the Microsoft TDSP team's corresponding repository.
Customize the contents of the group
repositories
If you want to customize the contents of your group repositories to meet the specific
needs of your group, you can do that now. You can modify the files, change the
directory structure, or add files that your group has developed or that are helpful for
your group.
2. At the top of the page, select the repository you want to customize.
3. In the repo directory structure, navigate to the folder or file you want to change.
1. On the GroupCommon project Summary page, select Repos, and at the top of the
page, select the repository you want to clone.
2. On the repo page, select Clone at upper right.
3. In the Clone repository dialog, select HTTPS for an HTTP connection, or SSH for
an SSH connection, and copy the clone URL under Command line to your
clipboard.
HTTPS connection:
Bash
git clone
https://DataScienceUnit@dev.azure.com/DataScienceUnit/GroupCommon/_git/
GroupUtilities
SSH connection:
Bash
git clone
git@ssh.dev.azure.com:v3/DataScienceUnit/GroupCommon/GroupUtilities
After making whatever changes you want in the local clone of your repository, you can
push the changes to the shared group common repositories.
Run the following Git Bash commands from your local GroupProjectTemplate or
GroupUtilities directory.
Bash
git add .
git commit -m "push from local"
git push
7 Note
If this is the first time you commit to a Git repository, you may need to configure
global parameters user.name and user.email before you run the git commit
command. Run the following two commands:
If you're committing to several Git repositories, use the same name and email
address for all of them. Using the same name and email address is convenient
when building Power BI dashboards to track your Git activities in multiple
repositories.
1. In Azure DevOps, from the GroupCommon project home page, select Project
settings from the left navigation.
2. From the Project Settings left navigation, select Teams, then on the Teams page,
select the GroupCommon Team.
3. On the Team Profile page, select Add.
4. In the Add users and groups dialog, search for and select members to add to the
group, and then select Save changes.
To configure permissions for members:
2. On the Permissions page, select the group you want to add members to.
3. On the page for that group, select Members, and then select Add.
4. In the Invite members popup, search for and select members to add to the group,
and then select Save.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Here are links to detailed descriptions of the other roles and tasks in the Team Data
Science Process:
This article describes the tasks that a team lead completes for their data science team.
The team lead's objective is to establish a collaborative team environment that
standardizes on the Team Data Science Process (TDSP). The TDSP is designed to help
improve collaboration and team learning.
The TDSP is an agile, iterative data science methodology to efficiently deliver predictive
analytics solutions and intelligent applications. The process distills the best practices and
structures from Microsoft and the industry. The goal is successful implementation of
data science initiatives and fully realizing the benefits of their analytics programs. For an
outline of the personnel roles and associated tasks for a data science team
standardizing on the TDSP, see Team Data Science Process roles and tasks.
A team lead manages a team consisting of several data scientists in the data science unit
of an enterprise. Depending on the data science unit's size and structure, the group
manager and the team lead might be the same person, or they could delegate their
tasks to surrogates. But the tasks themselves do not change.
The following diagram shows the workflow for the tasks the team lead completes to set
up a team environment:
7 Note
This article uses Azure DevOps and a DSVM to set up a TDSP team environment,
because that is how to implement TDSP at Microsoft. If your team uses other code
hosting or development platforms, the team lead tasks are the same, but the way
to complete them may be different.
Prerequisites
This tutorial assumes that the following resources and permissions have been set up by
your group manager:
To be able to clone repositories and modify their content on your local machine or
DSVM, or set up Azure file storage and mount it to your DSVM, you need the following:
An Azure subscription.
Git installed on your machine. If you're using a DSVM, Git is pre-installed.
Otherwise, see the Platforms and tools appendix.
If you want to use a DSVM, the Windows or Linux DSVM created and configured in
Azure. For more information and instructions, see the Data Science Virtual Machine
Documentation.
For a Windows DSVM, Git Credential Manager (GCM) installed on your machine.
In the README.md file, scroll down to the Download and Install section and select
the latest installer. Download the .exe installer from the installer page and run it.
For a Linux DSVM, an SSH public key set up on your DSVM and added in Azure
DevOps. For more information and instructions, see the Create SSH public key
section in the Platforms and tools appendix.
The names specified for the repositories and directories in this tutorial assume that you
want to establish a separate project for your own team within your larger data science
organization. However, the entire group can choose to work under a single project
created by the group manager or organization administrator. Then, all the data science
teams create repositories under this single project. This scenario might be valid for:
A small data science group that doesn't have multiple data science teams.
A larger data science group with multiple data science teams that nevertheless
wants to optimize inter-team collaboration with activities such as group-level
sprint planning.
If teams choose to have their team-specific repositories under a single group project,
the team leads should create the repositories with names like <TeamName>Template
and <TeamName>Utilities. For instance: TeamATemplate and TeamAUtilities.
In any case, team leads need to let their team members know which template and
utilities repositories to set up and clone. Project leads should follow the project lead
tasks for a data science team to create project repositories, whether under separate
projects or a single project.
1. In your web browser, go to your group's Azure DevOps organization home page at
URL https://<server name>/<organization name>, and select New project.
2. In the Create project dialog, enter your team name, such as MyTeam, under
Project name, and then select Advanced.
3. Under Version control, select Git, and under Work item process, select Agile. Then
select Create.
The team project Summary page opens, with page URL https://<server
name>/<organization name>/<team name>.
3. On the Project Settings page, select the ... next to the MyTeam repository, and
then select Rename repository.
4. In the Rename the MyTeam repository popup, enter TeamUtilities, and then select
Rename.
Or, select Repos from the left navigation of the MyTeam project Summary page,
select a repository at the top of the page, and then select New repository from the
dropdown.
2. In the Create a new repository dialog, make sure Git is selected under Type. Enter
TeamTemplate under Repository name, and then select Create.
3. Confirm that you can see the two repositories TeamUtilities and TeamTemplate on
your project settings page.
3. In the Import a Git repository dialog, select Git as the Source type, and enter the
URL for your group common template repository under Clone URL. The URL is
https://<server name>/<organization name>/_git/<repository name>. For example:
https://dev.azure.com/DataScienceUnit/GroupCommon/_git/GroupProjectTemplate.
4. Select Import. The contents of your group template repository are imported into
your team template repository.
5. At the top of your project's Repos page, drop down and select the TeamUtilities
repository.
6. Repeat the import process to import the contents of your group common utilities
repository, for example GroupUtilities, into your TeamUtilities repository.
Each of your two team repositories now contains the files from the corresponding group
common repository.
2. At the top of the page, select the repository you want to customize.
3. In the repo directory structure, navigate to the folder or file you want to change.
To edit existing files, navigate to the file and then select Edit.
4. After adding or editing files, select Commit.
To work with repositories on your local machine or DSVM, you first copy or clone the
repositories to your local machine, and then commit and push your changes up to the
shared team repositories,
To clone repositories:
1. On the MyTeam project Summary page, select Repos, and at the top of the page,
select the repository you want to clone.
3. In the Clone repository dialog, under Command line, select HTTPS for an HTTP
connection or SSH for an SSH connection, and copy the clone URL to your
clipboard.
4. On your local machine, create the following directories:
6. In Git Bash, run the command git clone <clone URL> , where <clone URL> is the
URL you copied from the Clone dialog.
For example, use one of the following commands to clone the TeamUtilities
repository to the MyTeam directory on your local machine.
HTTPS connection:
Bash
git clone
https://DataScienceUnit@dev.azure.com/DataScienceUnit/MyTeam/_git/TeamU
tilities
SSH connection:
Bash
After making whatever changes you want in the local clone of your repository, commit
and push the changes to the shared team repositories.
Run the following Git Bash commands from your local
GitRepos\MyTeam\TeamTemplate or GitRepos\MyTeam\TeamUtilities directory.
Bash
git add .
git commit -m "push from local"
git push
7 Note
If this is the first time you commit to a Git repository, you may need to configure
global parameters user.name and user.email before you run the git commit
command. Run the following two commands:
If you're committing to several Git repositories, use the same name and email
address for all of them. Using the same name and email address is convenient
when building Power BI dashboards to track your Git activities in multiple
repositories.
1. In Azure DevOps, from the MyTeam project home page, select Project settings
from the left navigation.
2. From the Project Settings left navigation, select Teams, then on the Teams page,
select the MyTeam Team.
3. On the Team Profile page, select Add.
4. In the Add users and groups dialog, search for and select members to add to the
group, and then select Save changes.
2. On the Permissions page, select the group you want to add members to.
3. On the page for that group, select Members, and then select Add.
4. In the Invite members popup, search for and select members to add to the group,
and then select Save.
For information about sharing other resources with your team, such as Azure HDInsight
Spark clusters, see Platforms and tools. That topic provides guidance from a data
science perspective on selecting resources that are appropriate for your needs, and links
to product pages and other relevant and useful tutorials.
7 Note
To avoid transmitting data across data centers, which might be slow and costly,
make sure that your Azure resource group, storage account, and DSVM are all
hosted in the same Azure region.
PowerShell
wget "https://raw.githubusercontent.com/Azure/Azure-
MachineLearning-DataScience/master/Misc/TDSP/CreateFileShare.ps1"
-outfile "CreateFileShare.ps1"
.\CreateFileShare.ps1
shell
wget "https://raw.githubusercontent.com/Azure/Azure-
MachineLearning-DataScience/master/Misc/TDSP/CreateFileShare.sh"
bash CreateFileShare.sh
2. Log in to your Microsoft Azure account when prompted, and select the
subscription you want to use.
3. Select the storage account to use, or create a new one under your selected
subscription. You can use lowercase characters, numbers, and hyphens for the
Azure file storage name.
4. To facilitate mounting and sharing the storage, press Enter or enter Y to save the
Azure file storage information into a text file in your current directory. You can
check in this text file to your TeamTemplate repository, ideally under
Docs\DataDictionaries, so all projects in your team can access it. You also need the
file information to mount your Azure file storage to your Azure DSVM in the next
section.
PowerShell
wget "https://raw.githubusercontent.com/Azure/Azure-
MachineLearning-DataScience/master/Misc/TDSP/AttachFileShare.ps1"
-outfile "AttachFileShare.ps1"
.\AttachFileShare.ps1
wget "https://raw.githubusercontent.com/Azure/Azure-
MachineLearning-DataScience/master/Misc/TDSP/AttachFileShare.sh"
bash AttachFileShare.sh
2. Press Enter or enter Y to continue, if you saved an Azure file storage information
file in the previous step. Enter the complete path and name of the file you created.
If you don't have an Azure file storage information file, enter n, and follow the
instructions to enter your subscription, Azure storage account, and Azure file
storage information.
3. Enter the name of a local or TDSP drive to mount the file share on. The screen
displays a list of existing drive names. Provide a drive name that doesn't already
exist.
4. Confirm that the new drive and storage is successfully mounted on your machine.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Here are links to detailed descriptions of the other roles and tasks defined by the Team
Data Science Process:
This article describes tasks that a project lead completes to set up a repository for their
project team in the Team Data Science Process (TDSP). The TDSP is a framework
developed by Microsoft that provides a structured sequence of activities to efficiently
execute cloud-based, predictive analytics solutions. The TDSP is designed to help
improve collaboration and team learning. For an outline of the personnel roles and
associated tasks for a data science team standardizing on the TDSP, see Team Data
Science Process roles and tasks.
A project lead manages the daily activities of individual data scientists on a specific data
science project in the TDSP. The following diagram shows the workflow for project lead
tasks:
This tutorial covers Step 1: Create project repository, and Step 2: Seed project repository
from your team ProjectTemplate repository.
For Step 3: Create Feature work item for project, and Step 4: Add Stories for project
phases, see Agile development of data science projects.
For Step 5: Create and customize storage/analysis assets and share, if necessary, see
Create team data and analytics resources.
For Step 6: Set up security control of project repository, see Add team members and
configure permissions.
7 Note
This article uses Azure Repos to set up a TDSP project, because that is how to
implement TDSP at Microsoft. If your team uses another code hosting platform, the
project lead tasks are the same, but the way to complete them may be different.
Prerequisites
This tutorial assumes that your group manager and team lead have set up the following
resources and permissions:
To clone repositories and modify content on your local machine or Data Science Virtual
Machine (DSVM), or set up Azure file storage and mount it to your DSVM, you also need
to consider this checklist:
An Azure subscription.
Git installed on your machine. If you're using a DSVM, Git is pre-installed.
Otherwise, see the Platforms and tools appendix.
If you want to use a DSVM, the Windows or Linux DSVM created and configured in
Azure. For more information and instructions, see the Data Science Virtual Machine
Documentation.
For a Windows DSVM, Git Credential Manager (GCM) installed on your machine.
In the README.md file, scroll down to the Download and Install section and select
the latest installer. Download the .exe installer from the installer page and run it.
For a Linux DSVM, an SSH public key set up on your DSVM and added in Azure
DevOps. For more information and instructions, see the Create SSH public key
section in the Platforms and tools appendix.
2. Select the repository name at the top of the page, and then select New repository
from the dropdown.
3. In the Create a new repository dialog, make sure Git is selected under Type. Enter
DSProject1 under Repository name, and then select Create.
4. Confirm that you can see the new DSProject1 repository on your project settings
page.
Import the team template into your project
repository
To populate your project repository with the contents of your team template repository:
1. From your team's project Summary page, select Repos in the left navigation.
2. Select the repository name at the top of the page, and select DSProject1 from the
dropdown.
5. Select Import. The contents of your team template repository are imported into
your project repository.
If you need to customize the contents of your project repository to meet your project's
specific needs, you can add, delete, or modify repository files and folders. You can work
directly in Azure Repos, or clone the repository to your local machine or DSVM, make
changes, and commit and push your updates to the shared project repository. Follow
the instructions at Customize the contents of the team repositories.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Here are links to detailed descriptions of the other roles and tasks defined by the Team
Data Science Process:
This topic outlines the tasks that an individual contributor completes to set up a project
in the Team Data Science Process (TDSP). The objective is to work in a collaborative
team environment that standardizes on the TDSP. The TDSP is designed to help improve
collaboration and team learning. For an outline of the personnel roles and their
associated tasks that are handled by a data science team standardizing on the TDSP, see
Team Data Science Process roles and tasks.
The following diagram shows the tasks that project individual contributors (data
scientists) complete to set up their team environment. For instructions on how to
execute a data science project under the TDSP, see Execution of data science projects.
7 Note
This article uses Azure Repos and a Data Science Virtual Machine (DSVM) to set up
a TDSP environment, because that is how to implement TDSP at Microsoft. If your
team uses other code hosting or development platforms, the individual contributor
tasks are the same, but the way to complete them may be different.
Prerequisites
This tutorial assumes that the following resources and permissions have been set up by
your group manager, team lead, and project lead:
To clone repositories and modify content on your local machine or DSVM, or mount
Azure file storage to your DSVM, you need to consider this checklist:
An Azure subscription.
Git installed on your machine. If you're using a DSVM, Git is pre-installed.
Otherwise, see the Platforms and tools appendix.
If you want to use a DSVM, the Windows or Linux DSVM created and configured in
Azure. For more information and instructions, see the Data Science Virtual Machine
Documentation.
For a Windows DSVM, Git Credential Manager (GCM) installed on your machine.
In the README.md file, scroll down to the Download and Install section and select
the latest installer. Download the .exe installer from the installer page and run it.
For a Linux DSVM, an SSH public key set up on your DSVM and added in Azure
DevOps. For more information and instructions, see the Create SSH public key
section in the Platforms and tools appendix.
The Azure file storage information for any Azure file storage you need to mount to
your DSVM.
Clone repositories
To work with repositories locally and push your changes up to the shared team and
project repositories, you first copy or clone the repositories to your local machine.
2. Select Repos in the left navigation, and at the top of the page, select the repository
you want to clone.
7. In Git Bash, run the command git clone <clone URL> for each repository you want
to clone.
For example, the following command clones the TeamUtilities repository to the
MyTeam directory on your local machine.
HTTPS connection:
Bash
git clone
https://DataScienceUnit@dev.azure.com/DataScienceUnit/MyTeam/_git/TeamU
tilities
SSH connection:
Bash
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Here are links to detailed descriptions of the other roles and tasks defined by the Team
Data Science Process:
This document describes how developers can execute a data science project in a
systematic, version controlled, and collaborative way within a project team by using the
Team Data Science Process (TDSP). The TDSP is a framework developed by Microsoft
that provides a structured sequence of activities to efficiently execute cloud-based,
predictive analytics solutions. For an outline of the roles and tasks that are handled by a
data science team standardizing on the TDSP, see Team Data Science Process roles and
tasks.
The following instructions outline the steps needed to set up a TDSP team environment
using Azure Boards and Azure Repos in Azure DevOps. The instructions use Azure
DevOps because that is how to implement TDSP at Microsoft. If your group uses a
different code hosting platform, the team lead tasks generally don't change, but the way
to complete the tasks is different. For example, linking a work item with a Git branch
might not be the same with GitHub as it is with Azure Repos.
The following figure illustrates a typical sprint planning, coding, and source-control
workflow for a data science project:
Work item types
In the TDSP sprint planning framework, there are four frequently used work item types:
Features, User Stories, Tasks, and Bugs. The backlog for all work items is at the project
level, not the Git repository level.
User Story: User Stories are work items needed to complete a Feature end-to-end.
Examples of User Stories include:
Get data
Explore data
Generate features
Build models
Operationalize models
Retrain models
Task: Tasks are assignable work items that need to be done to complete a specific
User Story. For example, Tasks in the User Story Get data could be:
Get SQL Server credentials
Upload data to Azure Synapse Analytics
Bug: Bugs are issues in existing code or documents that must be fixed to complete
a Task. If Bugs are caused by missing work items, they can escalate to be User
Stories or Tasks.
Data scientists may feel more comfortable using an agile template that replaces
Features, User Stories, and Tasks with TDSP lifecycle stages and substages. To create an
agile-derived template that specifically aligns with the TDSP lifecycle stages, see Use an
agile TDSP work template.
7 Note
TDSP borrows the concepts of Features, User Stories, Tasks, and Bugs from software
code management (SCM). The TDSP concepts might differ slightly from their
conventional SCM definitions.
Plan sprints
Many data scientists are engaged with multiple projects, which can take months to
complete and proceed at different paces. Sprint planning is useful for project
prioritization, and resource planning and allocation. In Azure Boards, you can easily
create, manage, and track work items for your projects, and conduct sprint planning to
ensure projects are moving forward as expected.
For more information about sprint planning in Azure Boards, see Assign backlog items
to a sprint.
1. From your project page, select Boards > Backlogs in the left navigation.
2. On the Backlog tab, if the work item type in the top bar is Stories, drop down and
select Features. Then select New Work Item.
3. Enter a title for the Feature, usually your project name, and then select Add to top.
4. From the Backlog list, select and open the new Feature. Fill in the description,
assign a team member, and set planning parameters.
You can also link the Feature to the project's Azure Repos code repository by
selecting Add link under the Development section.
You can also link the User Story to a branch of the project's Azure Repos code
repository by selecting Add link under the Development section. Select the
repository and branch you want to link the work item to, and then select OK.
3. When you're finished editing the User Story, select Save & Close.
To add a Task to a User Story, select the + next to the User Story item, and select Task.
Fill in the title and other information in the Task.
After you create Features, User Stories, and Tasks, you can view them in the Backlogs or
Boards views to track their status.
3. In the All processes pane, select the ... next to Agile, and then select Create
inherited process.
4. In the Create inherited process from Agile dialog, enter the name
AgileDataScienceProcess, and select Create process.
5. In All processes, select the new AgileDataScienceProcess.
6. On the Work item types tab, disable Epic, Feature, User Story, and Task by
selecting the ... next to each item and then selecting Disable.
7. In All processes, select the Backlog levels tab. Under Portfolios backlogs, select
the ... next to Epic (disabled), and then select Edit/Rename.
9. Follow the same steps to rename Features to TDSP Stages, and add the following
new work item types:
Business Understanding
Data Acquisition
Modeling
Deployment
10. Under Requirement backlog, rename Stories to TDSP Substages, add the new work
item type TDSP Substage, and set the default work item type to TDSP Substage.
11. Under Iteration backlog, add a new work item type TDSP Task, and set it to be the
default work item type.
After you complete the steps, the backlog levels should look like this:
1. From your Azure DevOps organization main page, select New project.
2. In the Create new project dialog, give your project a name, and then select
Advanced.
3. Under Work item process, drop down and select AgileDataScienceProcess, and
then select Create.
4. In the newly created project, select Boards > Backlogs in the left navigation.
5. To make TDSP Projects visible, select the Configure team settings icon. In the
Settings screen, select the TDSP Projects check box, and then select Save and
close.
6. To create a data science-specific TDSP Project, select TDSP Projects in the top bar,
and then select New work item.
7. In the popup, give the TDSP Project work item a name, and select Add to top.
8. To add a work item under the TDSP Project, select the + next to the project, and
then select the type of work item to create.
9. Fill in the details in the new work item, and select Save & Close.
10. Continue to select the + symbols next to work items to add new TDSP Stages,
Substages, and Tasks.
Here is an example of how the data science project work items should appear in
Backlogs view:
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Collaborative coding with Git describes how to do collaborative code development
for data science projects using Git as the shared code development framework,
and how to link these coding activities to the work planned with the agile process.
Agile process
Agile process work item types and workflow
Related resources
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Collaborative coding with Git
Article • 11/15/2022
This article describes how to use Git as the collaborative code development framework
for data science projects. The article covers how to link code in Azure Repos to agile
development work items in Azure Boards, how to do code reviews, and how to create
and merge pull requests for changes.
To connect a work item to a new branch, select the Actions ellipsis (...) next to the work
item, and on the context menu, scroll to and select New branch.
In the Create a branch dialog, provide the new branch name and the base Azure Repos
Git repository and branch. The base repository must be in the same Azure DevOps
project as the work item. The base branch can be any existing branch. Select Create
branch.
You can also create a new branch using the following Git bash command in Windows or
Linux:
Bash
If you don't specify a <base branch name>, the new branch is based on main .
Bash
After you switch to the working branch, you can start developing code or
documentation artifacts to complete the work item. Running git checkout main
switches you back to the main branch.
It's a good practice to create a Git branch for each User Story work item. Then, for each
Task work item, you can create a branch based on the User Story branch. Organize the
branches in a hierarchy that corresponds to the User Story-Task relationship when you
have multiple people working on different User Stories for the same project, or on
different Tasks for the same User Story. You can minimize conflicts by having each team
member work on a different branch, or on different code or other artifacts when sharing
a branch.
The following diagram shows the recommended branching strategy for TDSP. You might
not need as many branches as shown here, especially when only one or two people
work on a project, or only one person works on all Tasks of a User Story. But separating
the development branch from the primary branch is always a good practice, and can
help prevent the release branch from being interrupted by development activities. For a
complete description of the Git branch model, see A Successful Git Branching Model .
You can also link a work item to an existing branch. On the Detail page of a work item,
select Add link. Then select an existing branch to link the work item to, and select OK.
Bash
git status
git add .
git commit -m "added an R script file"
git push origin script
Create a pull request
After one or more commits and pushes, when you're ready to merge your current
working branch into its base branch, you can create and submit a pull request in Azure
Repos.
From the main page of your Azure DevOps project, point to Repos > Pull requests in
the left navigation. Then select either of the New pull request buttons, or the Create a
pull request link.
On the New Pull Request screen, if necessary, navigate to the Git repository and branch
you want to merge your changes into. Add or change any other information you want.
Under Reviewers, add the names of the reviewers, and then select Create.
When you go back to Repos in the left navigation, you can see that you've been
switched to the main branch since the script branch was deleted.
You can also use the following Git bash commands to merge the script working branch
to its base branch and delete the working branch after merging:
Bash
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Execute data science tasks shows how to use utilities to complete several common data
science tasks, such as interactive data exploration, data analysis, reporting, and model
creation.
Related resources
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Execute data science tasks: exploration,
modeling, and deployment
Article • 11/15/2022
Typical data science tasks include data exploration, modeling, and deployment. This
article outlines the tasks to complete several common data science tasks such as
interactive data exploration, data analysis, reporting, and model creation. Options for
deploying a model into a production environment may include:
1. Exploration
A data scientist can perform exploration and reporting in a variety of ways: by using
libraries and packages available for Python (matplotlib for example) or with R (ggplot or
lattice for example). Data scientists can customize such code to fit the needs of data
exploration for specific scenarios. The needs for dealing with structured data are
different that for unstructured data such as text or images.
Products such as Azure Machine Learning also provide advanced data preparation for
data wrangling and exploration, including feature creation. The user should decide on
the tools, libraries, and packages that best suite their needs.
The deliverable at the end of this phase is a data exploration report. The report should
provide a fairly comprehensive view of the data to be used for modeling and an
assessment of whether the data is suitable to proceed to the modeling step.
2. Modeling
There are numerous toolkits and packages for training models in a variety of languages.
Data scientists should feel free to use which ever ones they are comfortable with, as
long as performance considerations regarding accuracy and latency are satisfied for the
relevant business use cases and production scenarios.
Model management
After multiple models have been built, you usually need to have a system for registering
and managing the models. Typically you need a combination of scripts or APIs and a
backend database or versioning system. Azure Machine Learning provides deployment
of ONNX models or deployment of ML Flow models.
3. Deployment
Production deployment enables a model to play an active role in a business. Predictions
from a deployed model can be used for business decisions.
Production platforms
There are various approaches and platforms to put models into production. We
recommend deployment to Azure Machine Learning.
7 Note
Prior to deployment, one has to ensure the latency of model scoring is low enough
to use in production.
A/B testing
When multiple models are in production, it can be useful to perform A/B testing to
compare performance of the models.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Track progress of data science projects shows how a data scientist can track the
progress of a data science project.
Model operation and CI/CD shows how CI/CD can be performed with developed
models.
Test data science code with Azure
DevOps
Article • 11/15/2022
This article gives preliminary guidelines for testing code in a data science workflow,
using Azure DevOps. Such testing gives data scientists a systematic and efficient way to
check the quality and expected outcome of their code. We use a Team Data Science
Process (TDSP) project that uses the UCI Adult Income dataset that we published
earlier to show how code testing can be done.
Data preparation
Data quality examination
Modeling
Model deployment
This article replaces the term "unit testing" with "code testing." It refers to testing as the
functions that help to assess if code for a certain step of a data science lifecycle is
producing results "as expected." The person who's writing the test defines what's "as
expected," depending on the outcome of the function--for example, data quality check
or modeling.
After you create your project, you'll find it in Solution Explorer in the right pane:
2. Feed your project code into the Azure DevOps project code repository:
3. Suppose you've done some data preparation work, such as data ingestion, feature
engineering, and creating label columns. You want to make sure your code is
generating the results that you expect. Here's some code that you can use to test
whether the data-processing code is working properly:
4. After you've done the data processing and feature engineering work, and you've
trained a good model, make sure that the model you trained can score new
datasets correctly. You can use the following two tests to check the prediction
levels and distribution of label values:
Check prediction levels:
5. Put all test functions together into a Python script called test_funcs.py:
6. After the test codes are prepared, you can set up the testing environment in Visual
Studio.
Create a Python file called test1.py. In this file, create a class that includes all the
tests you want to do. The following example shows six tests prepared:
1. Those tests can be automatically discovered if you put codetest.testCase after your
class name. Open Test Explorer in the right pane, and select Run All. All the tests
will run sequentially and will tell you if the test is successful or not.
2. Check in your code to the project repository by using Git commands. Your most
recent work will be reflected shortly in Azure DevOps.
3. Set up automatic build and test in Azure DevOps:
a. In the project repository, select Build and Release, and then select +New to
create a new build process.
b. Follow the prompts to select your source code location, project name,
repository, and branch information.
d. Name the build and select the agent. You can choose the default here if you
want to use a DSVM to complete the build process. For more information about
setting agents, see Build and release agents.
e. Select + in the left pane, to add a task for this build phase. Because we're going
to run the Python script test1.py to complete all the checks, this task is using a
PowerShell command to run Python code.
f. In the PowerShell details, fill in the required information, such as the name and
version of PowerShell. Choose Inline Script as the type.
In the box under Inline Script, you can type python test1.py. Make sure the
environment variable is set up correctly for Python. If you need a different version
or kernel of Python, you can explicitly specify the path as shown in the figure:
g. Select Save & queue to complete the build pipeline process.
Now every time a new commit is pushed to the code repository, the build process will
start automatically. You can define any branch. The process runs the test1.py file in the
agent machine to make sure that everything defined in the code runs correctly.
If alerts are set up correctly, you'll be notified in email when the build is finished. You
can also check the build status in Azure DevOps. If it fails, you can check the details of
the build and find out which piece is broken.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
See the UCI income prediction repository for concrete examples of unit tests for
data science scenarios.
Follow the preceding outline and examples from the UCI income prediction
scenario in your own data science projects.
References
Team Data Science Process
Visual Studio Testing Tools
Azure DevOps Testing Resources
Data Science Virtual Machines
Track the progress of data science
projects
Article • 08/31/2023
Data science group managers, team leads, and project leads can track the progress of
their projects. Managers want to know what work has been done, who did the work, and
what work remains. Managing expectations is an important element of success.
For instructions on how to create and customize dashboards and widgets in Azure
DevOps, see the following quickstarts:
Example dashboard
Here is a simple example dashboard that tracks the sprint activities of an Agile data
science project, including the number of commits to associated repositories.
The countdown tile shows the number of days that remain in the current sprint.
The two code tiles show the number of commits in the two project repositories for
the past seven days.
Work items for TDSP Customer Project shows the results of a query for all work
items and their status.
A cumulative flow diagram (CFD) shows the number of Closed and Active work
items.
The burndown chart shows work still to complete against remaining time in the
sprint.
The burnup chart shows completed work compared to total amount of work in the
sprint.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Create CI/CD pipelines for AI apps
using Azure Pipelines, Docker, and
Kubernetes
Azure Azure Pipelines Azure Machine Learning Azure Kubernetes Service (AKS)
7 Note
The following process is one of several ways to do CI/CD. There are alternatives to
this tooling and the prerequisites.
To use the downloaded source code and tutorial, you need the following prerequisites:
Principal author:
Next steps
Team Data Science Process (TDSP)
Azure Machine Learning (AML)
Azure DevOps
Azure Kubernetes Services (AKS)
Related resources
What are Azure Machine Learning pipelines?
Compare Microsoft machine learning products and technologies
Machine learning operations (MLOps) v2
Team Data Science Process for data
scientists
Article • 10/09/2023
This article provides guidance to a set of objectives that are typically used to implement
comprehensive data science solutions with Azure technologies. You are guided through:
These training materials are related to the Team Data Science Process (TDSP) and
Microsoft and open-source software and toolkits, which are helpful for envisioning,
executing and delivering data science solutions.
Lesson Path
You can use the items in the following table to guide your own self-study. Read the
Description column to follow the path, click on the Topic links for study references, and
check your skills using the Knowledge Check column.
Setup and Microsoft Azure Now let's create an account If you do not have an Azure
Configure your in Microsoft Azure for Account, create one . Log in
training, training and learn how to to the Microsoft Azure portal
development, create development and and create one Resource
and test environments. These Group for training.
Objective Topic Description Knowledge Check
The Microsoft There are multiple ways of Set your default subscription
Azure working with Microsoft with the Azure CLI.
Command-Line Azure – from graphical tools
Interface (CLI) like VSCode and Visual
Studio, to Web interfaces
such as the Azure portal,
and from the command line,
such as Azure PowerShell
commands and functions. In
this article, we cover the
Command-Line Interface
(CLI), which you can use
locally on your workstation,
in Windows and other
Operating Systems, as well
as in the Azure portal.
Microsoft Entra Microsoft Entra ID forms the Add one user to Microsoft
ID basis of securing your Entra ID. NOTE: You may not
application. In this article, have permissions for this
you learn more about action if you are not the
accounts, rights, and administrator for the
permissions. Active subscription. If that's the
Directory and security are case, simply review this
complex topics, so just read tutorial to learn more.
through this resource to
understand the
fundamentals.
The Microsoft You can install the tools for Create a Data Science Virtual
Azure Data working with Data Science Machine and work through at
Science Virtual locally on multiple least one lab.
Machine operating systems. But the
Objective Topic Description Knowledge Check
Install and Working with To follow our DevOps Clone this GitHub project for
Understand git process with the TDSP, we your learning path project
the tools and need to have a version- structure .
technologies control system. Microsoft
for working Azure Machine Learning
with Data uses git, a popular open-
Science source distributed
solutions repository system. In this
article, you learn more
about how to install,
configure, and work with git
and a central repository –
GitHub.
Working with Notebooks are a way of Open this page , and click
Notebooks introducing text and code in on the "Welcome to
the same document. Azure Python.ipynb" link. Work
Machine Learning work with through the examples on that
Notebooks, so it is page.
beneficial to understand
how to use them. Read
through this tutorial and
give it a try in the
Knowledge Check section.
scikit-learn The scikit-learn set of tools Using the Iris dataset, persist
allows you to perform data an SVM model using Pickle.
science tasks in Python. We
use this framework in our
solution. This article covers
the basics and explains
where you can learn more.
Create a Data Determining the With the development Locate a resource on "The 5
Processing Question, environment installed and data science questions" and
Flow from following the configured, and the describe one question your
Business TDSP understanding of the organization might have in
Requirements technologies and processes these areas. Which
in place, it's time to put algorithms should you focus
everything together using on for that question?
the TDSP to perform an
analysis. We need to start by
defining the question,
selecting the data sources,
and the rest of the steps in
the Team Data Science
Process. Keep in mind the
DevOps process as we work
through this process. In this
article, you learn how to
take the requirements from
your organization and
create a data flow map
through your application to
define your solution using
the Team Data Science
Process
Monitor your Application There are multiple tools you Set up Application Insights to
Solution Insights can use to monitor your end monitor an Application .
solution. Azure Application
Insights makes it easy to
integrate built-in
monitoring into your
solution.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
See Team Data Science Process for Developer Operations. This article explores the
Developer Operations (DevOps) functions that are specific to an Advanced Analytics and
Cognitive Services solution implementation.
Related resources
Execute data science tasks: exploration, modeling, and deployment
Set up data science environments for use in the Team Data Science Process
Platforms and tools for data science projects
Data science and machine learning with Azure Databricks
What is the Team Data Science Process?
Team Data Science Process for
Developer Operations
Article • 06/21/2023
This article explores the Developer Operations (DevOps) functions that are specific to an
Advanced Analytics and Cognitive Services solution implementation. These training
materials implement the Team Data Science Process (TDSP) and Microsoft and open-
source software and toolkits, helpful for envisioning, executing and delivering data
science solutions. It references topics that cover the DevOps Toolchain that is specific to
Data Science and AI projects and solutions.
Lesson Path
The following table provides level-based guidance to help complete the DevOps
objectives for implementing data science solutions on Azure.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
See Team Data Science Process for data scientists. This article provides guidance for
implementing data science solutions with Azure.
Related resources
Execute data science tasks: exploration, modeling, and deployment
Set up data science environments for use in the Team Data Science Process
Platforms and tools for data science projects
Data science and machine learning with Azure Databricks
What is the Team Data Science Process?
Set up data science environments for
use in the Team Data Science Process
Article • 11/15/2022
The Team Data Science Process uses various data science environments for the storage,
processing, and analysis of data. They include Azure Blob Storage, several types of Azure
virtual machines, HDInsight (Hadoop) clusters, and Machine Learning workspaces. The
decision about which environment to use depends on the type and quantity of data to
be modeled and the target destination for that data in the cloud.
See Quickstart: Create workspace resources you need to get started with Azure
Machine Learning.
The Microsoft Data Science Virtual Machine (DSVM) is also available as an Azure virtual
machine (VM) image. This VM is pre-installed and configured with several popular tools
that are commonly used for data analytics and machine learning. The DSVM is available
on both Windows and Linux. For more information, see Introduction to the cloud-based
Data Science Virtual Machine for Linux and Windows.
Windows DSVM
Ubuntu DSVM
CentOS DSVM
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Microsoft provides a full spectrum of analytics resources for both cloud or on-premises
platforms. They can be deployed to make the execution of your data science projects
efficient and scalable. Guidance for teams implementing data science projects in a
trackable, version controlled, and collaborative way is provided by the Team Data
Science Process (TDSP). See Team Data Science Process roles and tasks, for an outline of
the personnel roles, and their associated tasks that are handled by a data science team
standardizing on this process.
The main recommended Azure resource for TDSP is Azure Machine Learning. Examples
in Azure Architecture Center sometimes show Azure Machine Learning used with other
Azure resources. These other analytics resources available to data science teams using
the TDSP include:
In this document, we briefly describe the resources and provide links to the tutorials and
walkthroughs the TDSP teams have published. The articles will show you how to these
resources step by step to build your intelligent applications. More information on these
resources is available on their product pages.
It also includes ML and AI tools like xgboost, mxnet, and Vowpal Wabbit.
Currently DSVM is available in Windows and Linux CentOS operating systems. Choose
the size of your DSVM (number of CPU cores and the amount of memory) based on the
needs of the data science projects that you plan to execute on it.
For more information on Windows edition of DSVM, see Microsoft Data Science Virtual
Machine on the Azure Marketplace. For the Linux edition of the DSVM, see Linux Data
Science Virtual Machine .
To learn how to execute some of the common data science tasks on the DSVM
efficiently, see 10 things you can do on the Data science Virtual Machine
When you create a Spark cluster in HDInsight, you create Azure compute resources with
Spark installed and configured. It takes about 10 minutes to create a Spark cluster in
HDInsight. Store the data to be processed in Azure Blob storage. For information on
using Azure Blob Storage with a cluster, see Use HDFS-compatible Azure Blob storage
with Hadoop in HDInsight.
TDSP team from Microsoft has published two end-to-end walkthroughs on how to use
Azure HDInsight Spark Clusters to build data science solutions, one using Python and
the other Scala. For more information on Azure HDInsight Spark Clusters, see Overview:
Apache Spark on HDInsight Linux. To learn how to build a data science solution using
Python on an Azure HDInsight Spark Cluster, see Overview of Data Science using Spark
on Azure HDInsight. To learn how to build a data science solution using Scala on an
Azure HDInsight Spark Cluster, see Data Science using Scala and Spark on Azure.
Azure Synapse Analytics
Azure Synapse Analytics allows you to scale compute resources easily and in seconds,
without over-provisioning or over-paying. It also offers the unique option to pause the
use of compute resources, giving you the freedom to better manage your cloud costs.
The ability to deploy scalable compute resources makes it possible to bring all your data
into Azure Synapse Analytics. Storage costs are minimal and you can run compute only
on the parts of datasets that you want to analyze.
For more information on Azure Synapse Analytics, see the Azure Synapse Analytics
website. To learn how to build end-to-end advanced analytics solutions with Azure
Synapse Analytics, see The Team Data Science Process in action: using Azure Synapse
Analytics.
For more information on Azure Data Lake, see Introducing Azure Data Lake . To learn
how to build a scalable end-to-end data science solution with Azure Data Lake, see
Scalable Data Science in Azure Data Lake: An end-to-end Walkthrough
Hive allows you to project structure on largely unstructured data. After you define the
structure, you can use Hive to query that data in a Hadoop cluster without having to
use, or even know, Java or MapReduce. HiveQL (the Hive query language) allows you to
write queries with statements that are similar to T-SQL.
For data scientists, Hive can run Python User-Defined Functions (UDFs) in Hive queries
to process records. This ability extends the capability of Hive queries in data analysis
considerably. Specifically, it allows data scientists to conduct scalable feature
engineering in languages they're mostly familiar with: the SQL-like HiveQL and Python.
For more information on Azure HDInsight Hive Clusters, see Use Hive and HiveQL with
Hadoop in HDInsight. To learn how to build a scalable end-to-end data science solution
with Azure HDInsight Hive Clusters, see The Team Data Science Process in action: using
HDInsight Hadoop clusters.
Especially useful for data science projects is the ability to create an Azure file store as
the place to share project data with your project team members. Each of them then has
access to the same copy of the data in the Azure file storage. They can also use this file
storage to share feature sets generated during the execution of the project. If the
project is a client engagement, your clients can create an Azure file storage under their
own Azure subscription to share the project data and features with you. In this way, the
client has full control of the project data assets. For more information on Azure File
Storage, see Get started with Azure File storage on Windows and How to use Azure File
Storage with Linux.
R Services (In-database) supports the open source R language with a comprehensive set
of SQL Server tools and technologies. They offer superior performance, security,
reliability, and manageability. You can deploy R solutions using convenient and familiar
tools. Your production applications can call the R runtime and retrieve predictions and
visuals using Transact-SQL. You also use the ScaleR libraries to improve the scale and
performance of your R solutions. For more information, see SQL Server R Services.
The TDSP team from Microsoft has published two end-to-end walkthroughs that show
how to build data science solutions in SQL Server 2016 R Services: one for R
programmers and one for SQL developers. For R Programmers, see Data Science End-
to-End Walkthrough. For SQL Developers, see In-Database Advanced Analytics for SQL
Developers (Tutorial).
PowerShell
PowerShell
4. Click <Your Name> at the top-right corner of the page and click security.
6. Paste the ssh key copied into the text box and save.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Understand data science for machine learning
Machine learning at scale
Introduction to Azure Machine Learning
Databricks Data Science & Engineering
Related resources
What is the Team Data Science Process?
Team Data Science Process roles and tasks
Compare the machine learning products and technologies from Microsoft
Identify scenarios and plan for
advanced analytics data processing
Article • 01/06/2023
What resources are required for you to create an environment that can perform
advanced analytics processing on a dataset? This article suggests a series of questions
to ask that can help identify tasks and resources relevant your scenario.
To learn about the order of high-level steps for predictive analytics, see What is the
Team Data Science Process (TDSP). Each step requires specific resources for the tasks
relevant to your particular scenario.
data logistics
data characteristics
dataset quality
preferred tools and languages
You may need to move the data several times during the analytics process. A common
scenario is to move local data into some form of storage on Azure and then into Azure
Machine Learning.
For more information, see Move data from a SQL Server database to SQL Azure with
Azure Data Factory.
Useful techniques for data inspection include descriptive statistics calculation and
visualization plots. For details of how to explore a dataset in various Azure
environments, see Explore data in the Team Data Science Process.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
What is the Team Data Science Process (TDSP)?
Related resources
Execute data science tasks: exploration, modeling, and deployment
Set up data science environments for use in the Team Data Science Process
Platforms and tools for data science projects
Data science and machine learning with Azure Databricks
Load data into storage environments for
analytics
Article • 11/15/2022
The Team Data Science Process requires that data be ingested or loaded into the most
appropriate way in each stage. Data destinations can include Azure Blob Storage, SQL
Azure databases, SQL Server on Azure VM, HDInsight (Hadoop), Azure Synapse
Analytics, and Azure Machine Learning.
The following articles describe how to ingest data into various target environments
where the data is stored and processed.
Technical and business needs, as well as the initial location, format, and size of your data
will determine the best data ingestion plan. It is not uncommon for a best plan to have
several steps. This sequence of tasks can include, for example, data exploration, pre-
processing, cleaning, down-sampling, and model training. Azure Data Factory is a
recommended Azure resource to orchestrate data movement and transformation.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
The Team Data Science Process requires that data be ingested or loaded into a variety of
different storage environments to be processed or analyzed in the most appropriate
way in each stage of the process. Azure Blob Storage has comprehensive documentation
at this link but this section in TDSP documentation provides a summary starter.
Which method is best for you depends on your scenario. The Scenarios for advanced
analytics in Azure Machine Learning article helps you determine the resources you need
for a variety of data science workflows used in the advanced analytics process.
7 Note
For a complete introduction to Azure blob storage, refer to Azure Blob Basics and
to Azure Blob Service.
Create and schedule a pipeline that downloads data from Azure Blob storage.
Pass it to a published Azure Machine Learning web service.
Receive the predictive analytics results.
Upload the results to storage.
For more information, see Create predictive pipelines using Azure Data Factory and
Azure Machine Learning.
Prerequisites
This article assumes that you have an Azure subscription, a storage account, and the
corresponding storage key for that account. Before uploading/downloading data, you
must know your Azure Storage account name and account key.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Introduction to Azure Blob Storage
Copy and move blobs from one container or storage account to another
What is the Team Data Science Process (TDSP)?
Related resources
Explore data in Azure Blob storage
Process Azure Blob Storage data with advanced analytics
Set up data science environments for use in the Team Data Science Process
Load data into storage environments for analytics
Move data to and from Azure Blob
Storage using Azure Storage Explorer
Article • 01/06/2023
Azure Storage Explorer is a free tool from Microsoft that allows you to work with Azure
Storage data on Windows, macOS, and Linux. This topic describes how to use it to
upload and download data from Azure Blob Storage. The tool can be downloaded from
Microsoft Azure Storage Explorer .
This menu links to technologies you can use to move data to and from Azure Blob
storage:
7 Note
If you are using VM that was set up with the scripts provided by Data Science
Virtual machines in Azure, then Azure Storage Explorer is already installed on the
VM.
7 Note
For a complete introduction to Azure Blob Storage, refer to Azure Blob Basics and
Azure Blob Service REST API.
Prerequisites
This document assumes that you have an Azure subscription, a storage account, and the
corresponding storage key for that account. Before uploading/downloading data, you
must know your Azure Storage account name and account key.
3. To bring up the Connect to Azure Storage wizard, select the Connect to Azure
Storage icon.
4. Enter the access key from your Azure Storage account on the Connect to Azure
Storage wizard and then Next.
5. Enter storage account name in the Account name box and then select Next.
6. The storage account added should now be displayed. To create a blob container in
a storage account, right-click the Blob Containers node in that account, select
Create Blob Container, and enter a name.
7. To upload data to a container, select the target container and click the Upload
button.
8. Click on the ... to the right of the Files box, select one or multiple files to upload
from the file system and click Upload to begin uploading the files.
9. To download data, selecting the blob in the corresponding container to download
and click Download.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Mark Tabladillo | Senior Cloud Solution Architect
Next steps
Introduction to Azure Blob Storage
Upload, download, and manage data with Azure Storage Explorer
What is the Team Data Science Process (TDSP)?
Related resources
Explore data in Azure Blob storage
Process Azure Blob Storage data with advanced analytics
Set up data science environments for use in the Team Data Science Process
Load data into storage environments for analytics
Move data to or from Azure Blob
Storage using SSIS connectors
Article • 01/06/2023
The Azure Feature Pack for Integration Services (SSIS) provides components to connect
to Azure, transfer data between Azure and on-premises data sources, and process data
stored in Azure.
This menu links to technologies you can use to move data to and from Azure Blob
storage:
Once customers have moved on-premises data into the cloud, they can access their data
from any Azure service to leverage the full power of the suite of Azure technologies. The
data may be subsequently used, for example, in Azure Machine Learning or on an
HDInsight cluster.
Examples for using these Azure resources are in the SQL and HDInsight walkthroughs.
For a discussion of canonical scenarios that use SSIS to accomplish business needs
common in hybrid data integration scenarios, see Doing more with SQL Server
Integration Services Feature Pack for Azure blog.
7 Note
For a complete introduction to Azure blob storage, refer to Azure Blob Basics and
to Azure Blob Service REST API.
Prerequisites
To perform the tasks described in this article, you must have an Azure subscription and
an Azure Storage account set up. You need the Azure Storage account name and
account key to upload or download data.
7 Note
SSIS is installed with SQL Server, but is not included in the Express version. For
information on what applications are included in various editions of SQL Server, see
SQL Server Technical Documentation
For information on how to get up-and-running using SISS to build simple extraction,
transformation, and load (ETL) packages, see SSIS Tutorial: Creating a Simple ETL
Package.
Field Description
Field Description
BlobContainer Specifies the name of the blob container that holds the
uploaded files as blobs.
BlobDirectory Specifies the blob directory where the uploaded file is stored as
a block blob. The blob directory is a virtual hierarchical
structure. If the blob already exists, it is replaced.
FileName Specifies a name filter to select files with the specified name
pattern. For example, MySheet*.xls* includes files such as
MySheet001.xls and MySheetABC.xlsx
7 Note
To run a Hive script on an Azure HDInsight cluster with SSIS, use Azure HDInsight
Hive Task.
To run a Pig script on an Azure HDInsight cluster with SSIS, use Azure HDInsight
Pig Task.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Introduction to Azure Blob Storage
Copy and move blobs from one container or storage account to another
Execute existing SSIS packages in Azure Data Factory or Azure Synapse Pipeline
What is the Team Data Science Process (TDSP)?
Related resources
Explore data in Azure Blob storage
Process Azure Blob Storage data with advanced analytics
Move data to and from Azure Blob Storage using Azure Storage Explorer
Load data into storage environments for analytics
Move data to SQL Server on an Azure
virtual machine
Article • 01/06/2023
This article outlines the options for moving data either from flat files (CSV or TSV
formats) or from an on-premises SQL Server to SQL Server on an Azure virtual machine.
These tasks for moving data to the cloud are part of the Team Data Science Process.
For a topic that outlines the options for moving data to an Azure SQL Database for
Machine Learning, see Move data to an Azure SQL Database for Azure Machine
Learning.
The following table summarizes the options for moving data to SQL Server on an Azure
virtual machine.
On-Premises SQL Server 1. Deploy a SQL Server Database to a Microsoft Azure VM wizard
2. Export to a flat File
3. SQL Database Migration Wizard
4. Database back up and restore
This document assumes that SQL commands are executed from SQL Server
Management Studio or Visual Studio Database Explorer.
Tip
As an alternative, you can use Azure Data Factory to create and schedule a
pipeline that will move data to a SQL Server VM on Azure. For more information,
see Copy data with Azure Data Factory (Copy Activity).
Prerequisites
This tutorial assumes you have:
An Azure subscription. If you do not have a subscription, you can sign up for a
free trial .
An Azure storage account. You will use an Azure storage account for storing the
data in this tutorial. If you don't have an Azure storage account, see the Create a
storage account article. After you have created the storage account, you will need
to obtain the account key used to access the storage. See Manage storage account
access keys.
Provisioned SQL Server on an Azure VM. For instructions, see Set up an Azure
virtual machine for SQL Server as an IPython Notebook server for advanced
analytics.
Installed and configured Azure PowerShell locally. For instructions, see How to
install and configure Azure PowerShell.
7 Note
Where should my data be for BCP? While it is not required, having files containing
source data located on the same machine as the target SQL Server allows for faster
transfers (network speed vs local disk IO speed). You can move the flat files
containing data to the machine where SQL Server is installed using various file
copying tools such as AZCopy, Azure Storage Explorer or windows copy/paste
via Remote Desktop Protocol (RDP).
1. Ensure that the database and the tables are created on the target SQL Server
database. Here is an example of how to do that using the Create Database and
Create Table commands:
SQL
2. Generate the format file that describes the schema for the table by issuing the
following command from the command line of the machine where bcp is installed.
3. Insert the data into the database using the bcp command, which should work from
the command line when SQL Server is installed on same machine:
Optimizing BCP Inserts Please refer the following article 'Guidelines for Optimizing
Bulk Import' to optimize such inserts.
7 Note
Big data Ingestion To optimize data loading for large and very large datasets,
partition your logical and physical database tables using multiple file groups and
partition tables. For more information about creating and loading data to partition
tables, see Parallel Load SQL Partition Tables.
The following sample PowerShell script demonstrates parallel inserts using bcp:
PowerShell
$NO_OF_PARALLEL_JOBS=2
#Trusted connection w.o username password (if you are using windows auth
and are signed in with that credentials)
#bcp database..tablename in datafile_path.csv -o
path_to_outputfile.$partitionnumber.txt -h "TABLOCK" -F 2 -f
format_file_path.xml -T -b block_size_to_move_in_single_attempt -t "," -r
\n
}
Here are some sample commands for Bulk Insert are as below:
1. Analyze your data and set any custom options before importing to make sure that
the SQL Server database assumes the same format for any special fields such as
dates. Here is an example of how to set the date format as year-month-day (if your
data contains the date in year-month-day format):
SQL
SQL
For details on SQL Server Data Tools, see Microsoft SQL Server Data Tools
For details on the Import/Export Wizard, see SQL Server Import and Export Wizard
1. Export the data from on-premises SQL Server to a file using the bcp utility as
follows
2. Create the database and the table on SQL Server VM on Azure using the create
database and create table for the table schema exported in step 1.
3. Create a format file for describing the table schema of the data being
exported/imported. Details of the format file are described in Create a Format File
(SQL Server).
Format file generation when running BCP from the SQL Server computer
servername\sqlinstance -T -t \t -r \n
Format file generation when running BCP remotely against a SQL Server
4. Use any of the methods described in section Moving Data from File Source to
move the data in flat files to a SQL Server.
1. Database back up and restore functionality (both to a local file or bacpac export to
blob) and Data Tier Applications (using bacpac).
2. Ability to directly create SQL Server VMs on Azure with a copied database or copy
to an existing database in SQL Database. For more information, see Use the Copy
Database Wizard.
A screenshot of the Database back up/restore options from SQL Server Management
Studio is shown below.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Migrate a Database to SQL Server on an Azure VM
SQL Server on Azure Virtual Machines overview
Related resources
Move data to an Azure SQL Database for Azure Machine Learning
Move data from a SQL Server database to SQL Database with Azure Data Factory
Process data in a SQL Server virtual machine on Azure
What is the Team Data Science Process?
Move data to Azure SQL Database for
Azure Machine Learning
Article • 11/15/2022
This article outlines the options for moving data either from flat files (CSV or TSV
formats) or from data stored in SQL Server to an Azure SQL Database. These tasks for
moving data to the cloud are part of the Team Data Science Process.
For a topic that outlines the options for migrating data from SQL Server into Azure SQL
options, see Migrate to Azure SQL.
The following table summarizes the options for moving data to an Azure SQL Database.
Prerequisites
The procedures outlined here require that you have:
An Azure subscription. If you do not have a subscription, you can sign up for a
free trial .
An Azure storage account. You use an Azure storage account for storing the data
in this tutorial. If you don't have an Azure storage account, see the Create a
storage account article. After you have created the storage account, you need to
obtain the account key used to access the storage. See Manage storage account
access keys.
Access to an Azure SQL Database. If you must set up an Azure SQL Database,
Getting Started with Microsoft Azure SQL Database provides information on how
to provision a new instance of an Azure SQL Database.
Installed and configured Azure PowerShell locally. For instructions, see How to
install and configure Azure PowerShell.
Data: The migration processes are demonstrated using the NYC Taxi dataset . The NYC
Taxi dataset contains information on trip data and fares, which is either available
through Azure Open Datasets or from the source TLC Trip Record Data . A sample and
description of these files are provided in NYC Taxi Trips Dataset Description.
You can either adapt the procedures described here to a set of your own data or follow
the steps as described by using the NYC Taxi dataset. To upload the NYC Taxi dataset
into your SQL Server database, follow the procedure outlined in Bulk Import Data into
SQL Server Database.
The steps for the first three are similar to those sections in Move data to SQL Server on
an Azure virtual machine that cover these same procedures. Links to the appropriate
sections in that topic are provided in the following instructions.
Consider using ADF when data needs to be continually migrated with hybrid on-
premises and cloud sources. ADF also helps when the data needs transformations, or
needs new business logic during migration. ADF allows for the scheduling and
monitoring of jobs using simple JSON scripts that manage the movement of data on a
periodic basis. ADF also has other capabilities such as support for complex operations.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This article presents generic Hive queries that create Hive tables and load data from
Azure Blob Storage. Some guidance is also provided on partitioning Hive tables and on
using the Optimized Row Columnar (ORC) formatting to improve query performance.
Prerequisites
This article assumes that you have:
Created an Azure Storage account. If you need instructions, see About Azure
Storage accounts.
Provisioned a customized Hadoop cluster with the HDInsight service. If you need
instructions, see Setup Clusters in HDInsight.
Enabled remote access to the cluster, logged in, and opened the Hadoop
Command-Line console. If you need instructions, see Manage Apache Hadoop
clusters.
We assume that the data for Hive tables is in an uncompressed tabular format, and that
the data has been uploaded to the default (or to an additional) container of the storage
account used by the Hadoop cluster.
If you want to practice on the NYC Taxi Trip Data, you need to:
download the 24 NYC Taxi Trip Data files (12 Trip files and 12 Fare files) -- either
available through Azure Open Datasets or from the source TLC Trip Record Data ,
unzip all files into .csv files, and then
upload them to the default (or appropriate container) of the Azure Storage
account; options for such an account appear at Use Azure Storage with Azure
HDInsight clusters topic. The process to upload the .csv files to the default
container on the storage account can be found on this page.
Hive queries are SQL-like. If you are familiar with SQL, you may find the Hive for SQL
Users Cheat Sheet useful.
When submitting a Hive query, you can also control the destination of the output from
Hive queries, whether it be on the screen or to a local file on the head node or to an
Azure blob.
Log in to the head node of the Hadoop cluster, open the Hadoop Command Line on the
desktop of the head node, and enter command cd %hive_home%\bin .
You have three ways to submit Hive queries in the Hadoop Command Line:
directly
using .hql files
with the Hive command console
Console
By default, after Hive query is submitted in Hadoop Command Line, the progress of the
Map/Reduce job is printed out on screen. To suppress the screen print of the
Map/Reduce job progress, you can use an argument -S ("S" in upper case) in the
command line as follows:
Console
You can also first enter the Hive command console by running command hive in
Hadoop Command Line, and then submit Hive queries in Hive command console. Here
is an example. In this example, the two red boxes highlight the commands used to enter
the Hive command console, and the Hive query submitted in Hive command console,
respectively. The green box highlights the output from the Hive query.
The previous examples directly output the Hive query results on screen. You can also
write the output to a local file on the head node, or to an Azure blob. Then, you can use
other tools to further analyze the output of Hive queries.
Output Hive query results to a local file. To output Hive query results to a local
directory on the head node, you have to submit the Hive query in the Hadoop
Command Line as follows:
Console
You can also output the Hive query results to an Azure blob, within the default container
of the Hadoop cluster. The Hive query for this is as follows:
Console
In the following example, the output of Hive query is written to a blob directory
queryoutputdir within the default container of the Hadoop cluster. Here, you only need
to provide the directory name, without the blob name. An error is thrown if you provide
both directory and blob names, such as wasb:///queryoutputdir/queryoutput.txt .
If you open the default container of the Hadoop cluster using Azure Storage Explorer,
you can see the output of the Hive query as shown in the following figure. You can apply
the filter (highlighted by red box) to only retrieve the blob with specified letters in
names.
Submit Hive queries with the Hive Editor
You can also use the Query Console (Hive Editor) by entering a URL of the form
https://<Hadoop cluster name>.azurehdinsight.net/Home/HiveEditor into a web browser.
You must be logged in the see this console and so you need your Hadoop cluster
credentials here.
HiveQL
create database if not exists <database name>;
CREATE EXTERNAL TABLE if not exists <database name>.<table name>
(
field1 string,
field2 int,
field3 float,
field4 double,
...,
fieldN string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' lines
terminated by '<line separator>'
STORED AS TEXTFILE LOCATION '<storage location>'
TBLPROPERTIES("skip.header.line.count"="1");
Here are the descriptions of the fields that you need to plug in and other configurations:
<database name>: the name of the database that you want to create. If you just
want to use the default database, the query "create database..." can be omitted.
<table name>: the name of the table that you want to create within the specified
database. If you want to use the default database, the table can be directly referred
by <table name> without <database name>.
<field separator>: the separator that delimits fields in the data file to be uploaded
to the Hive table.
<line separator>: the separator that delimits lines in the data file.
<storage location>: the Azure Storage location to save the data of Hive tables. If
you do not specify LOCATION <storage location>, the database and the tables are
stored in hive/warehouse/ directory in the default container of the Hive cluster by
default. If you want to specify the storage location, the storage location has to be
within the default container for the database and tables. This location has to be
referred as location relative to the default container of the cluster in the format of
'wasb:///<directory 1>/' or 'wasb:///<directory 1>/<directory 2>/', etc. After the
query is executed, the relative directories are created within the default container.
TBLPROPERTIES("skip.header.line.count"="1"): If the data file has a header line,
you have to add this property at the end of the create table query. Otherwise, the
header line is loaded as a record to the table. If the data file does not have a
header line, this configuration can be omitted in the query.
HiveQL
LOAD DATA INPATH '<path to blob data>' INTO TABLE <database name>.<table
name>;
<path to blob data>: If the blob file to be uploaded to the Hive table is in the
default container of the HDInsight Hadoop cluster, the <path to blob data> should
be in the format 'wasb://<directory in this container>/<blob file name>'. The blob
file can also be in an additional container of the HDInsight Hadoop cluster. In this
case, <path to blob data> should be in the format 'wasb://<container
name>@<storage account name>.blob.core.windows.net/<blob file name>'.
7 Note
In addition to partitioning Hive tables, it is also beneficial to store the Hive data in the
Optimized Row Columnar (ORC) format. For more information on ORC formatting, see
Using ORC files improves performance when Hive is reading, writing, and processing
data .
Partitioned table
Here is the Hive query that creates a partitioned table and loads data into it.
HiveQL
HiveQL
select
field1, field2, ..., fieldN
from <database name>.<partitioned table name>
where <partitionfieldname>=<partitionfieldvalue> and ...;
Create an external table STORED AS TEXTFILE and load data from blob storage to the
table.
HiveQL
LOAD DATA INPATH '<path to the source file>' INTO TABLE <database name>.
<table name>;
Create an internal table with the same schema as the external table in step 1, with the
same field delimiter, and store the Hive data in the ORC format.
HiveQL
CREATE TABLE IF NOT EXISTS <database name>.<ORC table name>
(
field1 string,
field2 int,
...
fieldN date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' STORED AS ORC;
Select data from the external table in step 1 and insert into the ORC table
HiveQL
7 Note
Inserting it into the <database name>.<ORC table name> fails since <database
name>.<ORC table name> does not have the partition variable as a field in the
table schema. In this case, you need to specifically select the fields to be inserted to
<database name>.<ORC table name> as follows:
HiveQL
It is safe to drop the <external text file table name> when using the following query
after all data has been inserted into <database name>.<ORC table name>:
HiveQL
After following this procedure, you should have a table with data in the ORC format
ready to use.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This article describes how to build partitioned tables for fast parallel bulk importing of
data to a SQL Server database. For big data loading/transfer to a SQL database,
importing data to the SQL database and subsequent queries can be improved by using
Partitioned Tables and Views.
Add database filegroups to the database, which holds the partitioned physical files.
This can be done with CREATE DATABASE if new or ALTER DATABASE if the
database exists already.
7 Note
Specify the target filegroup, which holds data for this partition and the
physical database file name(s) where the filegroup data is stored.
The following example creates a new database with three filegroups other than the
primary and log groups, containing one physical file in each. The database files are
created in the default SQL Server Data folder, as configured in the SQL Server instance.
For more information about the default file locations, see File Locations for Default and
Named Instances of SQL Server.
SQL
EXECUTE ('
CREATE DATABASE <database_name>
ON PRIMARY
( NAME = ''Primary'', FILENAME = ''' + @data_path +
'<primary_file_name>.mdf'', SIZE = 4096KB , FILEGROWTH = 1024KB ),
FILEGROUP [filegroup_1]
( NAME = ''FileGroup1'', FILENAME = ''' + @data_path +
'<file_name_1>.ndf'' , SIZE = 4096KB , FILEGROWTH = 1024KB ),
FILEGROUP [filegroup_2]
( NAME = ''FileGroup2'', FILENAME = ''' + @data_path +
'<file_name_2>.ndf'' , SIZE = 4096KB , FILEGROWTH = 1024KB ),
FILEGROUP [filegroup_3]
( NAME = ''FileGroup3'', FILENAME = ''' + @data_path +
'<file_name_3>.ndf'' , SIZE = 102400KB , FILEGROWTH = 10240KB )
LOG ON
( NAME = ''LogFileGroup'', FILENAME = ''' + @data_path +
'<log_file_name>.ldf'' , SIZE = 1024KB , FILEGROWTH = 10%)
')
SQL
SQL
CREATE PARTITION SCHEME <DatetimeFieldPScheme> AS
PARTITION <DatetimeFieldPFN> TO (
<filegroup_1>, <filegroup_2>, <filegroup_3>, <filegroup_4>,
<filegroup_5>, <filegroup_6>, <filegroup_7>, <filegroup_8>,
<filegroup_9>, <filegroup_10>, <filegroup_11>, <filegroup_12> )
To verify the ranges in effect in each partition according to the function/scheme, run the
following query:
SQL
SQL
SQL
ALTER DATABASE <database_name> SET RECOVERY BULK_LOGGED
To expedite data loading, launch the bulk import operations in parallel. For tips on
expediting bulk importing of big data into SQL Server databases, see Load 1 TB in
less than 1 hour.
The following PowerShell script is an example of parallel data loading using BCP.
PowerShell
# Set database name, input data directory, and output log directory
# This example loads comma-separated input data files
# The example assumes the partitioned data files are named as
<base_file_name>_<partition_number>.csv
# Assumes the input data files include a header line. Loading starts at line
number 2.
$dbname = "<database_name>"
$indir = "<path_to_data_files>"
$logdir = "<path_to_log_directory>"
# Set number of partitions per table - Should match the number of input data
files per table
$numofparts = <number_of_partitions>
# Set table name to be loaded, basename of input data files, input format
file, and number of partitions
$tbname = "<table_name>"
$basename = "<base_input_data_filename_no_extension>"
$fmtfile = "<full_path_to_format_file>"
Get-Job
# Optional - Wait till all jobs complete and report date and time
date
While (Get-Job -State "Running") { Start-Sleep 10 }
date
Create indexes (clustered or non-clustered) targeting the same filegroup for each
partition, for example:
SQL
-- or,
7 Note
You may choose to create the indexes before bulk importing the data. Index
creation before bulk importing slows down the data loading.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This article shows how to move data from a SQL Server database to Azure SQL Database
via Azure Blob Storage using the Azure Data Factory (ADF): this method is a supported
legacy approach that has the advantages of a replicated staging copy, though we
suggest to look at our data migration page for the latest options .
For a table that summarizes various options for moving data to an Azure SQL Database,
see Move data to an Azure SQL Database for Azure Machine Learning.
With ADF, existing data processing services can be composed into data pipelines that
are highly available and managed in the cloud. These data pipelines can be scheduled to
ingest, prepare, transform, analyze, and publish data, and ADF manages and
orchestrates the complex data and processing dependencies. Solutions can be quickly
built and deployed in the cloud, connecting a growing number of on-premises and
cloud data sources.
ADF allows for the scheduling and monitoring of jobs using simple JSON scripts that
manage the movement of data on a periodic basis. ADF also has other capabilities such
as support for complex operations. For more information on ADF, see the
documentation at Azure Data Factory (ADF) .
The Scenario
We set up an ADF pipeline that composes two data migration activities. Together they
move data on a daily basis between a SQL Server database and Azure SQL Database.
The two activities are:
Copy data from a SQL Server database to an Azure Blob Storage account
Copy data from the Azure Blob Storage account to Azure SQL Database.
7 Note
The steps shown here have been adapted from the more detailed tutorial provided
by the ADF team: Copy data from a SQL Server database to Azure Blob storage
References to the relevant sections of that topic are provided when appropriate.
Prerequisites
This tutorial assumes you have:
An Azure subscription. If you do not have a subscription, you can sign up for a
free trial .
An Azure storage account. You use an Azure storage account for storing the data
in this tutorial. If you don't have an Azure storage account, see the Create a
storage account article. After you have created the storage account, you need to
obtain the account key used to access the storage. See Manage storage account
access keys.
Access to an Azure SQL Database. If you must set up an Azure SQL Database, the
topic Getting Started with Microsoft Azure SQL Database provides information on
how to provision a new instance of an Azure SQL Database.
Installed and configured Azure PowerShell locally. For instructions, see How to
install and configure Azure PowerShell.
7 Note
You can either adapt the procedure provided here to a set of your own data or follow
the steps as described by using the NYC Taxi dataset. To upload the NYC Taxi dataset
into your SQL Server database, follow the procedure outlined in Bulk Import Data into
SQL Server database.
7 Note
You should execute the Add-AzureAccount cmdlet before executing the New-
AzureDataFactoryTable cmdlet to confirm that the right Azure subscription is
selected for the command execution. For documentation of this cmdlet, see Add-
AzureAccount.
7 Note
These procedures use Azure PowerShell to define and create the ADF activities. But
these tasks can also be accomplished using the Azure portal. For details, see Create
datasets.
JSON
{
"name": "OnPremSQLTable",
"properties":
{
"location":
{
"type": "OnPremisesSqlServerTableLocation",
"tableName": "nyctaxi_data",
"linkedServiceName": "adfonpremsql"
},
"availability":
{
"frequency": "Day",
"interval": 1,
"waitOnExternal":
{
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
The column names were not included here. You can subselect on the column names by
including them here (for details check the ADF documentation topic.
Copy the JSON definition of the table into a file called onpremtabledef.json file and save
it to a known location (here assumed to be C:\temp\onpremtabledef.json). Create the
table in ADF with the following Azure PowerShell cmdlet:
Azure PowerShell
Blob Table
Definition for the table for the output blob location is in the following (this maps the
ingested data from on-premises to Azure blob):
JSON
{
"name": "OutputBlobTable",
"properties":
{
"location":
{
"type": "AzureBlobLocation",
"folderPath": "containername",
"format":
{
"type": "TextFormat",
"columnDelimiter": "\t"
},
"linkedServiceName": "adfds"
},
"availability":
{
"frequency": "Day",
"interval": 1
}
}
}
Copy the JSON definition of the table into a file called bloboutputtabledef.json file and
save it to a known location (here assumed to be C:\temp\bloboutputtabledef.json). Create
the table in ADF with the following Azure PowerShell cmdlet:
Azure PowerShell
JSON
{
"name": "OutputSQLAzureTable",
"properties":
{
"structure":
[
{ "name": "column1", "type": "String"},
{ "name": "column2", "type": "String"}
],
"location":
{
"type": "AzureSqlTableLocation",
"tableName": "your_db_name",
"linkedServiceName": "adfdssqlazure_linked_servicename"
},
"availability":
{
"frequency": "Day",
"interval": 1
}
}
}
Copy the JSON definition of the table into a file called AzureSqlTable.json file and save it
to a known location (here assumed to be C:\temp\AzureSqlTable.json). Create the table
in ADF with the following Azure PowerShell cmdlet:
Azure PowerShell
7 Note
The following procedures use Azure PowerShell to define and create the ADF
pipeline. But this task can also be accomplished using the Azure portal. For details,
see Create pipeline.
Using the table definitions provided previously, the pipeline definition for the ADF is
specified as follows:
JSON
{
"name": "AMLDSProcessPipeline",
"properties":
{
"description" : "This pipeline has two activities: the first one
copies data from SQL Server to Azure Blob, and the second one copies from
Azure Blob to Azure Database Table",
"activities":
[
{
"name": "CopyFromSQLtoBlob",
"description": "Copy data from SQL Server to blob",
"type": "CopyActivity",
"inputs": [ {"name": "OnPremSQLTable"} ],
"outputs": [ {"name": "OutputBlobTable"} ],
"transformation":
{
"source":
{
"type": "SqlSource",
"sqlReaderQuery": "select * from nyctaxi_data"
},
"sink":
{
"type": "BlobSink"
}
},
"Policy":
{
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 0,
"timeout": "01:00:00"
}
},
{
"name": "CopyFromBlobtoSQLAzure",
"description": "Push data to Sql Azure",
"type": "CopyActivity",
"inputs": [ {"name": "OutputBlobTable"} ],
"outputs": [ {"name": "OutputSQLAzureTable"} ],
"transformation":
{
"source":
{
"type": "BlobSource"
},
"sink":
{
"type": "SqlSink",
"WriteBatchTimeout": "00:5:00",
}
},
"Policy":
{
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 2,
"timeout": "02:00:00"
}
}
]
}
}
Copy this JSON definition of the pipeline into a file called pipelinedef.json file and save it
to a known location (here assumed to be C:\temp\pipelinedef.json). Create the pipeline in
ADF with the following Azure PowerShell cmdlet:
Azure PowerShell
Azure PowerShell
The startdate and enddate parameter values need to be replaced with the actual dates
between which you want the pipeline to run.
Once the pipeline executes, you should be able to see the data show up in the container
selected for the blob, one file per day.
We have not leveraged the functionality provided by ADF to pipe data incrementally. For
more information on how to do this and other capabilities provided by ADF, see the
ADF documentation .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Pre-processing and cleaning data are important tasks that must be conducted before a
dataset can be used for model training. Raw data is often noisy and unreliable, and may
be missing values. Using such data for modeling can produce misleading results. These
tasks are part of the Team Data Science Process (TDSP) and typically follow an initial
exploration of a dataset used to discover and plan the pre-processing required. For
more detailed instructions on the TDSP process, see the steps outlined in the Team Data
Science Process.
Pre-processing and cleaning tasks, like the data exploration task, can be carried out in a
wide variety of environments, such as SQL or Hive or Azure Machine Learning Studio
(classic), and with various tools and languages, such as R or Python, depending where
your data is stored and how it is formatted. Since TDSP is iterative in nature, these tasks
can take place at various steps in the workflow of the process.
This article introduces various data processing concepts and tasks that can be
undertaken either before or after ingesting data into Azure Machine Learning Studio
(classic).
For an example of data exploration and pre-processing done inside Azure Machine
Learning Studio (classic), see the Pre-processing data video.
Quality data is a prerequisite for quality predictive models. To avoid "garbage in,
garbage out" and improve data quality and therefore model performance, it is
imperative to conduct a data health screen to spot data issues early and decide on the
corresponding data processing and cleaning steps.
What are some typical data health screens that
are employed?
We can check the general quality of data by checking:
When you find issues with data, processing steps are necessary, which often involves
cleaning missing values, data normalization, discretization, text processing to remove
and/or replace embedded characters that may affect data alignment, mixed data types
in common fields, and others.
Azure Machine Learning consumes well-formed tabular data. If the data is already in
tabular form, data pre-processing can be performed directly with Azure Machine
Learning Studio (classic) in the Machine Learning. If data is not in tabular form, say it is
in XML, parsing may be required in order to convert the data to tabular form.
Equal-Width Binning: Divide the range of all possible values of an attribute into N
groups of the same size, and assign the values that fall in a bin with the bin
number.
Equal-Height Binning: Divide the range of all possible values of an attribute into N
groups, each containing the same number of instances, then assign the values that
fall in a bin with the bin number.
How to reduce data?
There are various methods to reduce data size for easier data handling. Depending on
data size and the domain, the following methods can be applied:
Record Sampling: Sample the data records and only choose the representative
subset from the data.
Attribute Sampling: Select only a subset of the most important attributes from the
data.
Aggregation: Divide the data into groups and store the numbers for each group.
For example, the daily revenue numbers of a restaurant chain over the past 20
years can be aggregated to monthly revenue to reduce the size of the data.
Data exploration offers an early view into the data. A number of data issues can be
uncovered during this step and corresponding methods can be applied to address those
issues. It is important to ask questions such as what is the source of the issue and how
the issue may have been introduced. This process also helps you decide on the data
processing steps that need to be taken to resolve them. Identifying the final use cases
and personas can also be used to prioritize the data processing effort.
References
Data Mining: Concepts and Techniques, Third Edition, Morgan Kaufmann, 2011,
Jiawei Han, Micheline Kamber, and Jian Pei
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Preprocess large datasets with Azure Machine Learning
Azure Machine Learning Studio
What is Azure Machine Learning?
Related resources
Explore data in the Team Data Science Process
Sample data in Azure blob containers, SQL Server, and Hive tables
Process Azure Blob Storage data with advanced analytics
What is the Team Data Science Process?
Explore data in the Team Data Science
Process
Article • 11/15/2022
The following articles describe how to explore data in three different storage
environments that are typically used in the Data Science Process:
Explore Azure blob container data using the Pandas Python package.
Explore SQL Server data by using SQL and by using a programming language like
Python.
Explore Hive table data using Hive queries.
The Azure Machine Learning Resources provide documentation and videos on getting
started with Azure Machine Learning.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This article covers how to explore data that is stored in Azure blob container using the
pandas Python package.
Prerequisites
This article assumes that you have:
Created an Azure storage account. If you need instructions, see Create an Azure
Storage account
Stored your data in an Azure Blob storage account. If you need instructions, see
Moving data to and from Azure Storage
1. Download the data from Azure blob with the following Python code sample using
Blob service. Replace the variable in the following code with your specific values:
Python
STORAGEACCOUNTURL= <storage_account_url>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
2. Read the data into a pandas DataFrame from the downloaded file.
Python
If you need more general information on reading from an Azure Storage Blob, look at
our documentation Azure Storage Blobs client library for Python.
Now you are ready to explore the data and generate features on this dataset.
Python
Python
dataframe_blobdata.head(10)
dataframe_blobdata.tail(10)
3. Check the data type each column was imported as using the following sample
code
Python
4. Check the basic stats for the columns in the data set as follows
Python
dataframe_blobdata.describe()
Python
dataframe_blobdata['<column_name>'].value_counts()
6. Count missing values versus the actual number of entries in each column using
the following sample code
Python
7. If you have missing values for a specific column in the data, you can drop them as
follows:
Python
dataframe_blobdata_noNA = dataframe_blobdata.dropna()
dataframe_blobdata_noNA.shape
Python
dataframe_blobdata_mode = dataframe_blobdata.fillna(
{'<column_name>': dataframe_blobdata['<column_name>'].mode()[0]})
8. Create a histogram plot using variable number of bins to plot the distribution of a
variable
Python
dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')
np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)
Python
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This article covers how to explore data that is stored in a SQL Server VM on Azure. Use
SQL or Python to examine the data.
7 Note
The sample SQL statements in this document assume that data is in SQL Server. If it
isn't, refer to the cloud data science process map to learn how to move your data
to SQL Server.
7 Note
For a practical example, you can use the NYC Taxi dataset and refer to the IPNB
titled NYC Data wrangling using IPython Notebook and SQL Server for an end-
to-end walk-through.
The following connection string format can be used to connect to a SQL Server
database from Python using pyodbc (replace servername, dbname, username, and
password with your specific values):
Python
The Pandas library in Python provides a rich set of data structures and data analysis
tools for data manipulation for Python programming. The following code reads the
results returned from a SQL Server database into a Pandas data frame:
Python
# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select <columnname1>, <columnname2>... from
<tablename>''', conn)
Now you can work with the Pandas DataFrame as covered in the topic Process Azure
Blob data in your data science environment.
Principal author:
This article provides sample Hive scripts that are used to explore data in Hive tables in
an HDInsight Hadoop cluster.
Prerequisites
This article assumes that you have:
Created an Azure storage account. If you need instructions, see Create an Azure
Storage account
Provisioned a customized Hadoop cluster with the HDInsight service. If you need
instructions, see Customize Azure HDInsight Hadoop Clusters for Advanced
Analytics.
The data has been uploaded to Hive tables in Azure HDInsight Hadoop clusters. If
it has not, follow the instructions in Create and load data to Hive tables to upload
data to Hive tables first.
Enabled remote access to the cluster. If you need instructions, see Access the Head
Node of Hadoop Cluster.
If you need instructions on how to submit Hive queries, see How to Submit Hive
Queries
<column_name>
HiveQL
SELECT
a.<common_columnname1> as <new_name1>,
a.<common_columnname2> as <new_name2>,
a.<a_column_name1> as <new_name3>,
a.<a_column_name2> as <new_name4>,
b.<b_column_name1> as <new_name5>,
b.<b_column_name2> as <new_name6>
FROM
(
SELECT <common_columnname1>,
<common_columnname2>,
<a_column_name1>,
<a_column_name2>,
FROM <databasename>.<tablename1>
) a
join
(
SELECT <common_columnname1>,
<common_columnname2>,
<b_column_name1>,
<b_column_name2>,
FROM <databasename>.<tablename2>
) b
ON a.<common_columnname1>=b.<common_columnname1> and a.
<common_columnname2>=b.<common_columnname2>
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
The following articles describe how to sample data that is stored in one of three
different Azure locations:
This sampling task is a step in the Team Data Science Process (TDSP).
If the dataset you plan to analyze is large, it's usually a good idea to down-sample the
data to reduce it to a smaller but representative and more manageable size. Downsizing
may facilitate data understanding, exploration, and feature engineering. This sampling
role in the Cortana Analytics Process is to enable fast prototyping of the data processing
functions and machine learning models.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This article covers sampling data stored in Azure Blob storage by downloading it
programmatically and then sampling it using procedures written in Python.
Why sample your data? If the dataset you plan to analyze is large, it's usually a good
idea to down-sample the data to reduce it to a smaller but representative and more
manageable size. Sampling facilitates data understanding, exploration, and feature
engineering. Its role in the Cortana Analytics Process is to enable fast prototyping of the
data processing functions and machine learning models.
This sampling task is a step in the Team Data Science Process (TDSP).
Python
STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
2. Read data into a Pandas data-frame from the file downloaded above.
Python
import pandas as pd
Python
# A 1 percent sample
sample_ratio = 0.01
sample_size = np.round(dataframe_blobdata.shape[0] * sample_ratio)
sample_rows = np.random.choice(dataframe_blobdata.index.values,
sample_size)
dataframe_blobdata_sample = dataframe_blobdata.ix[sample_rows]
Now you can work with the above data frame with the one Percent sample for further
exploration and feature generation.
Python
dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t',
encoding='utf-8', index=False)
2. Upload the local file to an Azure blob using the following sample code:
Python
STORAGEACCOUNTNAME= <storage_account_name>
LOCALFILENAME= <local_file_name>
STORAGEACCOUNTKEY= <storage_account_key>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account
_key=STORAGEACCOUNTKEY)
localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming
file is in current working directory
try:
#perform upload
output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,loc
alfileprocessed)
except:
print ("Something went wrong with uploading to the blob:"+
BLOBNAME)
3. Make a datastore in Azure Machine Learning which points to the Azure Blob
Storage. This link describes the concept of datastores and how to subsequently
make a dataset for use with Azure Machine Learning.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This article shows how to sample data stored in SQL Server on Azure using either SQL or
the Python programming language. It also shows how to move sampled data into Azure
Machine Learning by saving it to a file, uploading it to an Azure blob, and then reading
it into Azure Machine Learning Studio.
The Python sampling uses the pyodbc ODBC library to connect to SQL Server on
Azure and the Pandas library to do the sampling.
7 Note
The sample SQL code in this document assumes that the data is in a SQL Server on
Azure. If it is not, refer to Move data to SQL Server on Azure article for instructions
on how to move your data to SQL Server on Azure.
Why sample your data? If the dataset you plan to analyze is large, it's usually a good
idea to down-sample the data to reduce it to a smaller but representative and more
manageable size. Sampling facilitates data understanding, exploration, and feature
engineering. Its role in the Team Data Science Process (TDSP) is to enable fast
prototyping of the data processing functions and machine learning models.
This sampling task is a step in the Team Data Science Process (TDSP).
Using SQL
This section describes several methods using SQL to perform simple random sampling
against the data in the database. Choose a method based on your data size and its
distribution.
The following two items show how to use newid in SQL Server to perform the sampling.
The method you choose depends on how random you want the sample to be (pk_id in
the following sample code is assumed to be an autogenerated primary key).
SQL
SQL
Tablesample can be used for sampling the data as well. This option may be a better
approach if your data size is large (assuming that data on different pages is not
correlated) and for the query to complete in a reasonable time.
SQL
SELECT *
FROM <table_name>
TABLESAMPLE (10 PERCENT)
7 Note
You can explore and generate features from this sampled data by storing it in a
new table
Python
Python
import pandas as pd
# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select column1, column2... from <table_name>
tablesample (0.1 percent)''', conn)
You can now work with the sampled data in the Pandas data frame.
Python
dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t',
encoding='utf-8', index=False)
Python
STORAGEACCOUNTNAME= <storage_account_name>
LOCALFILENAME= <local_file_name>
STORAGEACCOUNTKEY= <storage_account_key>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account
_key=STORAGEACCOUNTKEY)
localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming
file is in current working directory
try:
#perform upload
output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,loc
alfileprocessed)
except:
print ("Something went wrong with uploading blob:"+BLOBNAME)
3. This guide provides an overview of the next step to access data in Azure Machine
Learning through datastores and datasets.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This article describes how to down-sample data stored in Azure HDInsight Hive tables
using Hive queries to reduce it to a size more manageable for analysis. It covers three
popularly used sampling methods:
Why sample your data? If the dataset you plan to analyze is large, it's usually a good
idea to down-sample the data to reduce it to a smaller but representative and more
manageable size. Down-sampling facilitates data understanding, exploration, and
feature engineering. Its role in the Team Data Science Process is to enable fast
prototyping of the data processing functions and machine learning models.
This sampling task is a step in the Team Data Science Process (TDSP).
Python
Here, <sample rate, 0-1> specifies the proportion of records that the users want to
sample.
Python
HiveQL
For information on more advanced sampling methods that are available in Hive, see
LanguageManual Sampling .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
The preview of Microsoft Azure Machine Learning Python client library can enable
secure access to your Azure Machine Learning datasets from a local Python environment
and enables the creation and management of datasets in a workspace.
Prerequisites
The Python client library has been tested under the following environments:
requests
python-dateutil
pandas
Console
Alternatively, you can download and install from the sources on GitHub .
Console
If you have git installed on your machine, you can use pip to install directly from the git
repository:
Console
From the Azure Machine Learning Studio (classic) web interface, you can generate code
snippets that include all the necessary information to download and deserialize datasets
as pandas DataFrame objects on your local machine.
For security reasons, the code snippet functionality is only available to users that have
their role set as Owner for the workspace. Your role is displayed in Azure Machine
Learning Studio (classic) on the USERS page under Settings.
If your role is not set as Owner, you can either request to be reinvited as an owner, or
ask the owner of the workspace to provide you with the code snippet.
To obtain the authorization token, you may choose one of these options:
Ask for a token from an owner. Owners can access their authorization tokens from
the Settings page of their workspace in Azure Machine Learning Studio (classic).
Select Settings from the left pane and click AUTHORIZATION TOKENS to see the
primary and secondary tokens. Although either the primary or the secondary
authorization tokens can be used in the code snippet, it is recommended that
owners only share the secondary authorization tokens.
Once developers have obtained the workspace ID and authorization token, they are able
to access the workspace using the code snippet regardless of their role.
2. Select the dataset you would like to access. You can select any of the datasets from
the MY DATASETS list or from the SAMPLES list.
3. From the bottom toolbar, click Generate Data Access Code. If the data is in a
format incompatible with the Python client library, this button is disabled.
4. Select the code snippet from the window that appears and copy it to your
clipboard.
5. Paste the code into the notebook of your local Python application.
Intermediate datasets can be accessed as long as the data format is compatible with the
Python client library.
The following formats are supported (constants for these formats are in the
azureml.DataTypeIds class):
PlainText
GenericCSV
GenericTSV
GenericCSVNoHeader
GenericTSVNoHeader
You can determine the format by hovering over a module output node. It is displayed
along with the node name, in a tooltip.
Some of the modules, such as the Split module, output to a format named Dataset ,
which is not supported by the Python client library.
You need to use a conversion module, such as Convert to CSV, to get an output into a
supported format.
The following steps show an example that creates an experiment, runs it and accesses
the intermediate dataset.
3. Insert a Split module, and connect its input to the dataset module output.
4. Insert a Convert to CSV module and connect its input to one of the Split module
outputs.
5. Save the experiment, run it, and wait for the job to finish.
7. When the context menu appears, select Generate Data Access Code.
8. Select the code snippet and copy it to your clipboard from the window that
appears.
9. Paste the code in your notebook.
10. You can visualize the data using matplotlib. This displays in a histogram for the age
column:
Use the Machine Learning Python client library
to access, read, create, and manage datasets
Workspace
The workspace is the entry point for the Python client library. Provide the Workspace
class with your workspace ID and authorization token to create an instance:
Python
ws = Workspace(workspace_id='4c29e1adeba2e5a7cbeb0e4f4adfb4df',
authorization_token='f4f3ade2c6aefdb1afb043cd8bcf3daf')
Enumerate datasets
To enumerate all datasets in a given workspace:
Python
for ds in ws.datasets:
print(ds.name)
Python
for ds in ws.user_datasets:
print(ds.name)
To enumerate just the example datasets:
Python
for ds in ws.example_datasets:
print(ds.name)
Python
Python
ds = ws.datasets[0]
Metadata
Datasets have metadata, in addition to content. (Intermediate datasets are an exception
to this rule and do not have any metadata.)
print(ds.name)
print(ds.description)
print(ds.family_id)
print(ds.data_type_id)
print(ds.id)
print(ds.created_date)
print(ds.size)
Read contents
The code snippets provided by Machine Learning Studio (classic) automatically
download and deserialize the dataset to a pandas DataFrame object. This is done with
the to_dataframe method:
Python
frame = ds.to_dataframe()
If you prefer to download the raw data, and perform the deserialization yourself, that is
an option. At the moment, this is the only option for formats such as 'ARFF', which the
Python client library cannot deserialize.
Python
text_data = ds.read_as_text()
Python
binary_data = ds.read_as_binary()
Python
If you have your data in a pandas DataFrame, use the following code:
Python
dataset = ws.datasets.add_from_dataframe(
dataframe=frame,
data_type_id=DataTypeIds.GenericCSV,
name='my new dataset',
description='my description'
)
Python
dataset = ws.datasets.add_from_raw_data(
raw_data=raw_data,
data_type_id=DataTypeIds.GenericCSV,
name='my new dataset',
description='my description'
)
The Python client library is able to serialize a pandas DataFrame to the following formats
(constants for these are in the azureml.DataTypeIds class):
PlainText
GenericCSV
GenericTSV
GenericCSVNoHeader
GenericTSVNoHeader
To update an existing dataset, you first need to get a reference to the existing dataset:
Python
print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name) # 'existing dataset'
print(dataset.description) # 'data up to jan 2015'
Then use update_from_dataframe to serialize and replace the contents of the dataset on
Azure:
Python
dataset = ws.datasets['existing dataset']
dataset.update_from_dataframe(frame2)
print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name) # 'existing dataset'
print(dataset.description) # 'data up to jan 2015'
If you want to serialize the data to a different format, specify a value for the optional
data_type_id parameter.
Python
dataset.update_from_dataframe(
dataframe=frame2,
data_type_id=DataTypeIds.GenericTSV,
)
print(dataset.data_type_id) # 'GenericTSV'
print(dataset.name) # 'existing dataset'
print(dataset.description) # 'data up to jan 2015'
You can optionally set a new description by specifying a value for the description
parameter.
Python
dataset.update_from_dataframe(
dataframe=frame2,
description='data up to feb 2015',
)
print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name) # 'existing dataset'
print(dataset.description) # 'data up to feb 2015'
You can optionally set a new name by specifying a value for the name parameter. From
now on, you'll retrieve the dataset using the new name only. The following code updates
the data, name, and description.
Python
dataset = ws.datasets['existing dataset']
dataset.update_from_dataframe(
dataframe=frame2,
name='existing dataset v2',
description='data up to feb 2015',
)
print(dataset.data_type_id) # 'GenericCSV'
print(dataset.name) # 'existing dataset v2'
print(dataset.description) # 'data up to feb 2015'
The data_type_id , name and description parameters are optional and default to their
previous value. The dataframe parameter is always required.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
Explore and analyze data with Python
Azure ML Package client library for Python - version 1.2.0
Data collection and manipulation
Related resources
Explore data in the Team Data Science Process
Sample data in Azure blob containers, SQL Server, and Hive tables
Create features for data in SQL Server using SQL and Python
What is the Team Data Science Process?
Process Azure Blob Storage data with
advanced analytics
Article • 11/21/2022
This document covers exploring data and generating features from data stored in Azure
Blob Storage.
1. Download the data from Blob Storage with the following sample Python code
using Blob service. Replace the variable in the code below with your specific values:
Python
STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
2. Read the data into a Pandas data-frame from the downloaded file.
Python
Now you are ready to explore the data and generate features on this dataset.
Data Exploration
Here are a few examples of ways to explore data using Pandas:
Python
Python
dataframe_blobdata.head(10)
dataframe_blobdata.tail(10)
3. Check the data type each column was imported as using the following sample
code
Python
4. Check the basic stats for the columns in the data set as follows
Python
dataframe_blobdata.describe()
Python
dataframe_blobdata['<column_name>'].value_counts()
6. Count missing values versus the actual number of entries in each column using the
following sample code
Python
miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()
print miss_num
7. If you have missing values for a specific column in the data, you can drop them as
follows:
Python
dataframe_blobdata_noNA = dataframe_blobdata.dropna()
dataframe_blobdata_noNA.shape
Python
dataframe_blobdata_mode =
dataframe_blobdata.fillna({'<column_name>':dataframe_blobdata['<column_
name>'].mode()[0]})
8. Create a histogram plot using variable number of bins to plot the distribution of a
variable:
Python
dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')
np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)
Python
Feature Generation
We can generate features using Python as follows:
Indicator value-based Feature Generation
Categorical features can be created as follows:
Python
dataframe_blobdata['<categorical_column>'].value_counts()
Python
Python
Python
Python
Python
dataframe_blobdata_bin_bool = pd.get_dummies(dataframe_blobdata_bin_id,
prefix='<numeric_column>')
3. Finally, Join the dummy variables back to the original data frame
Python
dataframe_blobdata_with_bin_bool =
dataframe_blobdata.join(dataframe_blobdata_bin_bool)
Python
dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t',
encoding='utf-8', index=False)
Python
STORAGEACCOUNTNAME= <storage_account_name>
LOCALFILENAME= <local_file_name>
STORAGEACCOUNTKEY= <storage_account_key>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account
_key=STORAGEACCOUNTKEY)
localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming
file is in current working directory
try:
#perform upload
output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,loc
alfileprocessed)
except:
print ("Something went wrong with uploading blob:"+BLOBNAME)
3. Now the data can be read from the blob using the Azure Machine Learning Import
Data module as shown in the screen below:
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This walkthrough shows how to use Azure Data Lake to do data exploration and binary
classification tasks on a sample of the NYC taxi trip and fare dataset. The sample shows
you how to predict whether or not a tip is paid by a fare. It walks you through the steps
of the Team Data Science Process, end-to-end, from data acquisition to model training.
Then it shows you how to deploy a web service that publishes the model.
Technologies
These technologies are used in this walkthrough.
Data Lake Analytics is also a key part of Cortana Analytics Suite. It works with Azure
Synapse Analytics, Power BI, and Data Factory. This combination gives you a complete
cloud big data and advanced analytics platform.
This walkthrough begins by describing how to install the prerequisites and resources
that you need to complete the data science process tasks. Then it outlines the data
processing steps using U-SQL and concludes by showing how to use Python and Hive
with Azure Machine Learning studio (classic) to build and deploy the predictive models.
Python
This walkthrough also contains a section that shows how to build and deploy a
predictive model using Python with Azure Machine Learning sStudio. It provides a
Jupyter Notebook with the Python scripts for the steps in this process. The notebook
includes code for some additional feature engineering steps and models construction
such as multiclass classification and regression modeling in addition to the binary
classification model outlined here. The regression task is to predict the amount of the
tip based on other tip features.
Scripts
Only the principal steps are outlined in this walkthrough. You can download the full U-
SQL script and Jupyter Notebook from GitHub .
Prerequisites
Before you begin these topics, you must have the following:
An Azure subscription. If you don't already have one, see Get Azure free trial .
[Recommended] Visual Studio 2013 or later. If you don't already have one of these
versions installed, you can download a free Community version from Visual Studio
Community .
7 Note
Instead of Visual Studio, you can also use the Azure portal to submit Azure Data
Lake queries. Instructions are provided on how to do so both with Visual Studio
and on the portal in the section titled Process data with U-SQL.
This section provides instructions on how to create each of these resources. If you
choose to use Hive tables with Azure Machine Learning, instead of Python, to build a
model, you also need to provision an HDInsight (Hadoop) cluster. This alternative
procedure in described in the Option 2 section.
7 Note
The Azure Data Lake Store can be created either separately or when you create the
Azure Data Lake Analytics as the default storage. Instructions are referenced for
creating each of these resources separately, but the Data Lake storage account
need not be created separately.
The 'trip_data' CSV contains trip details, such as number of passengers, pickup and
dropoff points, trip duration, and trip length. Here are a few sample records:
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropo
ff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup
_latitude,dropoff_longitude,dropoff_latitude
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-
01 15:11:48,2013-01-01
15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-
06 00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-
05 18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-
07 23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-
07 23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868
The 'trip_fare' CSV contains details of the fare paid for each trip, such as payment type,
fare amount, surcharge and taxes, tips and tolls, and the total amount paid. Here are a
few sample records:
15:11:48,CSH,6.5,0,0.5,0,0,7
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06
00:18:35,CSH,6,0.5,0.5,0,0,7
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05
18:49:41,CSH,5.5,1,0.5,0,0,7
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07
23:54:15,CSH,5,0.5,0.5,0,0,6
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07
23:25:03,CSH,9.5,0.5,0.5,0,0,10.5
The unique key to join trip_data and trip_fare is composed of the following three fields:
medallion, hack_license and pickup_datetime. The raw CSV files can be accessed from an
Azure Storage blob. The U-SQL script for this join is in the Join trip and fare tables
section.
The U-SQL scripts are described here and provided in a separate file. You can download
the full U-SQL scripts from GitHub .
To execute U-SQL, Open Visual Studio, click File --> New --> Project, choose U-SQL
Project, name and save it to a folder.
7 Note
It's possible to use the Azure Portal to execute U-SQL instead of Visual Studio. You
can navigate to the Azure Data Lake Analytics resource on the portal and submit
queries directly as illustrated in the following figure:
SQL
Since there are headers in the first row, you need to remove the headers and change
column types into appropriate ones. You can either save the processed data to Azure
Data Lake Storage using
swebhdfs://data_lake_storage_name.azuredatalakestorage.net/folder_name/file_name
_ or to Azure Blob storage account using
wasb://container_name@blob_storage_account_name.blob.core.windows.net/blob_na
me.
SQL
Similarly you can read in the fare data sets. Right-click Azure Data Lake Storage, you can
choose to look at your data in Azure portal --> Data Explorer or File Explorer within
Visual Studio.
Data quality checks
After trip and fare tables have been read in, data quality checks can be done in the
following way. The resulting CSV files can be output to Azure Blob storage or Azure
Data Lake Storage.
SQL
@ex_1 =
SELECT
pickup_month,
COUNT(medallion) AS cnt_medallion,
COUNT(DISTINCT(medallion)) AS unique_medallion
FROM @trip2
GROUP BY pickup_month;
OUTPUT @ex_1
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_1.csv"
USING Outputters.Csv();
SQL
SQL
///find those invalid records in terms of pickup_longitude
@ex_3 =
SELECT COUNT(medallion) AS cnt_invalid_pickup_longitude
FROM @trip
WHERE
pickup_longitude <- 90 OR pickup_longitude > 90;
OUTPUT @ex_3
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_3.csv"
USING Outputters.Csv();
SQL
@trip_summary6 =
SELECT
vendor_id,
SUM(missing_medallion) AS medallion_empty,
COUNT(medallion) AS medallion_total,
COUNT(DISTINCT(medallion)) AS medallion_total_unique
FROM @res
GROUP BY vendor_id;
OUTPUT @trip_summary6
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_16.csv"
USING Outputters.Csv();
Data exploration
Do some data exploration with the following scripts to get a better understanding of the
data.
SQL
@ex_4 =
SELECT tipped,
COUNT(*) AS tip_freq
FROM @tip_or_not
GROUP BY tipped;
OUTPUT @ex_4
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_4.csv"
USING Outputters.Csv();
Find the distribution of tip amount with cut-off values: 0, 5, 10, and 20 dollars.
SQL
SQL
SQL
SQL
@model_data_full =
SELECT t.*,
f.payment_type, f.fare_amount, f.surcharge, f.mta_tax, f.tolls_amount,
f.total_amount, f.tip_amount,
(f.tip_amount > 0 ? 1: 0) AS tipped,
(f.tip_amount >20? 4: (f.tip_amount >10? 3:(f.tip_amount >5 ? 2:
(f.tip_amount > 0 ? 1: 0)))) AS tip_class
FROM @trip AS t JOIN @fare AS f
ON (t.medallion == f.medallion AND t.hack_license == f.hack_license AND
t.pickup_datetime == f.pickup_datetime)
WHERE (pickup_longitude != 0 AND dropoff_longitude != 0 );
For each level of passenger count, calculate the number of records, average tip amount,
variance of tip amount, percentage of tipped trips.
SQL
// contingency table
@trip_summary8 =
SELECT passenger_count,
COUNT(*) AS cnt,
AVG(tip_amount) AS avg_tip_amount,
VAR(tip_amount) AS var_tip_amount,
SUM(tipped) AS cnt_tipped,
(float)SUM(tipped)/COUNT(*) AS pct_tipped
FROM @model_data_full
GROUP BY passenger_count;
OUTPUT @trip_summary8
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_17.csv"
USING Outputters.Csv();
Data sampling
First, randomly select 0.1% of the data from the joined table:
SQL
@model_data_random_sample_1_1000 =
SELECT *
FROM @addrownumberres_randomsample
WHERE rownum % 1000 == 0;
OUTPUT @model_data_random_sample_1_1000
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_7_random_1_1000.csv"
USING Outputters.Csv();
SQL
@model_data_stratified_sample_1_1000 =
SELECT *
FROM @addrownumberres_stratifiedsample
WHERE rownum % 1000 == 0;
//// output to blob
OUTPUT @model_data_stratified_sample_1_1000
TO
"wasb://container_name@blob_storage_account_name.blob.core.windows.net/demo_
ex_9_stratified_1_1000.csv"
USING Outputters.Csv();
////output data to ADL
OUTPUT @model_data_stratified_sample_1_1000
TO
"swebhdfs://data_lake_storage_name.azuredatalakestore.net/nyctaxi_folder/dem
o_ex_9_stratified_1_1000.csv"
USING Outputters.Csv();
In the first option, you use the sampled data that has been written to an Azure
Blob (in the Data sampling step above) and use Python to build and deploy
models from Azure Machine Learning.
In the second option, you query the data in Azure Data Lake directly using a Hive
query. This option requires that you create a new HDInsight cluster or use an
existing HDInsight cluster where the Hive tables point to the NY Taxi data in Azure
Data Lake Storage. Both these options are discussed in the following sections.
Python
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
from time import time
import pyodbc
import os
from azure.storage.blob import BlobService
import tables
import time
import zipfile
import random
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from __future__ import division
from sklearn import linear_model
from azureml import services
text
CONTAINERNAME = 'test1'
STORAGEACCOUNTNAME = 'XXXXXXXXX'
STORAGEACCOUNTKEY = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYY'
BLOBNAME = 'demo_ex_9_stratified_1_1000_copy.csv'
blob_service =
BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTK
EY)
Read in as text
text
t1 = time.time()
data =
blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).split("\n")
t2 = time.time()
print(("It takes %s seconds to read in "+BLOBNAME) % (t2 - t1))
text
colnames =
['medallion','hack_license','vendor_id','rate_code','store_and_fwd_flag
','pickup_datetime','dropoff_datetime',
'passenger_count','trip_time_in_secs','trip_distance','pickup_longitude
','pickup_latitude','dropoff_longitude','dropoff_latitude',
'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tolls_amount',
'total_amount', 'tip_amount', 'tipped', 'tip_class', 'rownum']
df1 = pd.DataFrame([sub.split(",") for sub in data], columns =
colnames)
text
cols_2_float =
['trip_time_in_secs','pickup_longitude','pickup_latitude','dropoff_long
itude','dropoff_latitude',
'fare_amount',
'surcharge','mta_tax','tolls_amount','total_amount','tip_amount',
'passenger_count','trip_distance'
,'tipped','tip_class','rownum']
for col in cols_2_float:
df1[col] = df1[col].astype(float)
First you need to create dummy variables that can be used in scikit-learn models
Python
df1_payment_type_dummy = pd.get_dummies(df1['payment_type'],
prefix='payment_type_dummy')
df1_vendor_id_dummy = pd.get_dummies(df1['vendor_id'],
prefix='vendor_id_dummy')
Python
X = data.iloc[:,1:]
Y = data.tipped
Python
Python
model = LogisticRegression()
logit_fit = model.fit(X_train, Y_train)
print ('Coefficients: \n', logit_fit.coef_)
Y_train_pred = logit_fit.predict(X_train)
Score testing data set
Python
Y_test_pred = logit_fit.predict(X_test)
Python
#AUC
print metrics.auc(fpr_train,tpr_train)
print metrics.auc(fpr_test,tpr_test)
#Confusion Matrix
print metrics.confusion_matrix(Y_train,Y_train_pred)
print metrics.confusion_matrix(Y_test,Y_test_pred)
Find your workspace credentials from Azure Machine Learning studio (classic)
settings. In Azure Machine Learning studio, click Settings --> Name -->
Authorization Tokens.
Output
workspaceid = 'xxxxxxxxxxxxxxxxxxxxxxxxxxx'
auth_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxx'
Python
@services.publish(workspaceid, auth_token)
@services.types(trip_distance = float, passenger_count = float,
payment_type_dummy_CRD = float, payment_type_dummy_CSH=float,
payment_type_dummy_DIS = float, payment_type_dummy_NOC = float,
payment_type_dummy_UNK = float, vendor_id_dummy_CMT = float,
vendor_id_dummy_VTS = float)
@services.returns(int) #0, or 1
def predictNYCTAXI(trip_distance, passenger_count,
payment_type_dummy_CRD, payment_type_dummy_CSH,payment_type_dummy_DIS,
payment_type_dummy_NOC, payment_type_dummy_UNK, vendor_id_dummy_CMT,
vendor_id_dummy_VTS ):
inputArray = [trip_distance, passenger_count,
payment_type_dummy_CRD, payment_type_dummy_CSH, payment_type_dummy_DIS,
payment_type_dummy_NOC, payment_type_dummy_UNK, vendor_id_dummy_CMT,
vendor_id_dummy_VTS]
return logit_fit.predict(inputArray)
url = predictNYCTAXI.service.url
api_key = predictNYCTAXI.service.api_key
print url
print api_key
@services.service(url, api_key)
@services.types(trip_distance = float, passenger_count = float,
payment_type_dummy_CRD = float,
payment_type_dummy_CSH=float,payment_type_dummy_DIS = float,
payment_type_dummy_NOC = float, payment_type_dummy_UNK = float,
vendor_id_dummy_CMT = float, vendor_id_dummy_VTS = float)
@services.returns(float)
def NYCTAXIPredictor(trip_distance, passenger_count,
payment_type_dummy_CRD, payment_type_dummy_CSH,payment_type_dummy_DIS,
payment_type_dummy_NOC, payment_type_dummy_UNK, vendor_id_dummy_CMT,
vendor_id_dummy_VTS ):
pass
Call Web service API. Typically, wait 5-10 seconds after the previous step.
Python
NYCTAXIPredictor(1,2,1,0,0,0,0,0,1)
Then click Dashboard next to the Settings button and a window pops up. Click Hive
View in the upper right corner of the page and you should see the Query Editor.
Paste in the following Hive scripts to create a table. The location of data source is in
Azure Data Lake Storage reference in this way:
adl://data_lake_store_name.azuredatalakestore.net:443/folder_name/file_name.
HiveQL
When the query completes, you should see the results like this:
1. Get the data into Azure Machine Learning studio (classic) using the Import Data
module, available in the Data Input and Output section. For more information, see
the Import Data module reference page.
3. Paste the following Hive script in the Hive database query editor
HiveQL
4. Enter the URL of the HDInsight cluster (this URL can be found in the Azure portal),
then enter the Hadoop credentials, the location of the output data, and the Azure
Storage account name/key/container name.
An example of a binary classification experiment reading data from Hive table is shown
in the following figure:
After the experiment is created, click Set Up Web Service --> Predictive Web Service
Run the automatically created scoring experiment, when it finishes, click Deploy Web
Service
The web service dashboard displays shortly:
Summary
By completing this walkthrough, you've created a data science environment for building
scalable end-to-end solutions in Azure Data Lake. This environment was used to analyze
a large public dataset, taking it through the canonical steps of the Data Science Process,
from data acquisition through model training, and then to the deployment of the model
as a web service. U-SQL was used to process, explore, and sample the data. Python and
Hive were used with Azure Machine Learning studio (classic) to build and deploy
predictive models.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
The Team Data Science Process in action: using Azure Synapse Analytics
Overview of the Data Science Process using Spark on Azure HDInsight
Related resources
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Process data in a SQL Server virtual
machine on Azure
Article • 05/30/2023
This document covers how to explore data and generate features for data stored in a
SQL Server VM on Azure. This goal may be completed by data wrangling using SQL or
by using a programming language like Python.
7 Note
The sample SQL statements in this document assume that data is in SQL Server. If it
isn't, refer to the cloud data science process map to learn how to move your data
to SQL Server.
Using SQL
We describe the following data wrangling tasks in this section using SQL:
1. Data Exploration
2. Feature Generation
Data Exploration
Here are a few sample SQL scripts that can be used to explore data stores in SQL Server.
7 Note
For a practical example, you can use the NYC Taxi dataset and refer to the IPNB
titled NYC Data wrangling using IPython Notebook and SQL Server for an end-
to-end walk-through.
Feature Generation
In this section, we describe ways of generating features using SQL:
7 Note
Once you generate additional features, you can either add them as columns to the
existing table or create a new table with the additional features and primary key,
that can be joined with the original table.
SQL
SQL
The sign tells us whether we are north or south, east or west on the globe.
A nonzero hundreds digit tells us that we're using longitude, not latitude!
The tens digit gives a position to about 1,000 kilometers. It gives us useful
information about what continent or ocean we are on.
The units digit (one decimal degree) gives a position up to 111 kilometers (60
nautical miles, about 69 miles). It can tell you roughly what state, country, or region
you're in.
The first decimal place is worth up to 11.1 km: it can distinguish the position of one
large city from a neighboring large city.
The second decimal place is worth up to 1.1 km: it can separate one village from
the next.
The third decimal place is worth up to 110 m: it can identify a large agricultural
field or institutional campus.
The fourth decimal place is worth up to 11 m: it can identify a parcel of land. It is
comparable to the typical accuracy of an uncorrected GPS unit with no
interference.
The fifth decimal place is worth up to 1.1 m: it distinguishes trees from each other.
Accuracy to this level with commercial GPS units can only be achieved with
differential correction.
The sixth decimal place is worth up to 0.11 m: you can use this for laying out
structures in detail, for designing landscapes, building roads. It should be more
than good enough for tracking movements of glaciers and rivers. This can be
achieved by taking painstaking measures with GPS, such as differentially corrected
GPS.
The location information can be featurized as follows, separating out region, location,
and city information. You can also call a REST end point such as Bing Maps API available
at Find a Location by Point to get the region/district information.
SQL
select
<location_columnname>
,round(<location_columnname>,0) as l1
,l2=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 1 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),1,1) else '0' end
,l3=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 2 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),2,1) else '0' end
,l4=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 3 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),3,1) else '0' end
,l5=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 4 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),4,1) else '0' end
,l6=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 5 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),5,1) else '0' end
,l7=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 6 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),6,1) else '0' end
from <tablename>
These location-based features can be further used to generate additional count features
as described earlier.
Tip
You can programmatically insert the records using your language of choice. You
may need to insert the data in chunks to improve write efficiency (for an example of
how to do this using pyodbc, see A HelloWorld sample to access SQLServer with
python ). Another alternative is to insert data in the database using the BCP
utility.
Connecting to Azure Machine Learning
The newly generated feature can be added as a column to an existing table or stored in
a new table and joined with the original table for machine learning. Features can be
generated or accessed if already created, using the Import Data module in Azure
Machine Learning as shown below:
The following connection string format can be used to connect to a SQL Server
database from Python using pyodbc (replace servername, dbname, username, and
password with your specific values):
Python
The Pandas library in Python provides a rich set of data structures and data analysis
tools for data manipulation for Python programming. The code below reads the results
returned from a SQL Server database into a Pandas data frame:
Python
# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select <columnname1>, <columnname2>... from
<tablename>''', conn)
Now you can work with the Pandas data frame as covered in the article Process Azure
Blob data in your data science environment.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
The Azure Machine Learning Algorithm Cheat Sheet helps you choose the right
algorithm from the designer for a predictive analytics model.
Azure Machine Learning has a large library of algorithms from the classification,
recommender systems, clustering, anomaly detection, regression, and text analytics
families. Each is designed to address a different type of machine learning problem.
Download the cheat sheet here: Machine Learning Algorithm Cheat Sheet (11x17 in.)
Download and print the Machine Learning Algorithm Cheat Sheet in tabloid size to keep
it handy and get help choosing an algorithm.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This suite of topics shows how to use HDInsight Spark to complete common data
science tasks such as data ingestion, feature engineering, modeling, and model
evaluation. The data used is a sample of the 2013 NYC taxi trip and fare dataset. The
models built include logistic and linear regression, random forests, and gradient
boosted trees. The topics also show how to store these models in Azure blob storage
(WASB) and how to score and evaluate their predictive performance. More advanced
topics cover how models can be trained using cross-validation and hyper-parameter
sweeping. This overview topic also references the topics that describe how to set up the
Spark cluster that you need to complete the steps in the walkthroughs provided.
HDInsight Spark
HDInsight Spark is the Azure hosted offering of open-source Spark. It also includes
support for Jupyter PySpark notebooks on the Spark cluster that can run Spark SQL
interactive queries for transforming, filtering, and visualizing data stored in Azure Blobs
(WASB). PySpark is the Python API for Spark. The code snippets that provide the
solutions and show the relevant plots to visualize the data here run in Jupyter
notebooks installed on the Spark clusters. The modeling steps in these topics contain
code that shows how to train, evaluate, save, and consume each type of model.
pySpark-machine-learning-data-science-spark-data-exploration-modeling.ipynb :
Provides information on how to perform data exploration, modeling, and scoring
with several different algorithms.
pySpark-machine-learning-data-science-spark-advanced-data-exploration-
modeling.ipynb : Includes topics in notebook #1, and model development using
hyperparameter tuning and cross-validation.
pySpark-machine-learning-data-science-spark-model-consumption.ipynb :
Shows how to operationalize a saved model using Python on HDInsight clusters.
Spark2.0-pySpark3-machine-learning-data-science-spark-advanced-data-
exploration-modeling.ipynb : This file provides information on how to perform
data exploration, modeling, and scoring in Spark 2.0 clusters using the NYC Taxi
trip and fare data-set described here. This notebook may be a good starting point
for quickly exploring the code we have provided for Spark 2.0. For a more detailed
notebook analyzes the NYC Taxi data, see the next notebook in this list. See the
notes following this list that compares these notebooks.
Spark2.0-pySpark3_NYC_Taxi_Tip_Regression.ipynb : This file shows how to
perform data wrangling (Spark SQL and dataframe operations), exploration,
modeling and scoring using the NYC Taxi trip and fare data-set described here.
Spark2.0-pySpark3_Airline_Departure_Delay_Classification.ipynb : This file shows
how to perform data wrangling (Spark SQL and dataframe operations), exploration,
modeling and scoring using the well-known Airline On-time departure dataset
from 2011 and 2012. We integrated the airline dataset with the airport weather
data (for example, windspeed, temperature, altitude etc.) prior to modeling, so
these weather features can be included in the model.
7 Note
The airline dataset was added to the Spark 2.0 notebooks to better illustrate the
use of classification algorithms. See the following links for information about airline
on-time departure dataset and weather dataset:
7 Note
The Spark 2.0 notebooks on the NYC taxi and airline flight delay data-sets can take
10 mins or more to run (depending on the size of your HDI cluster). The first
notebook in the above list shows many aspects of the data exploration,
visualization and ML model training in a notebook that takes less time to run with
down-sampled NYC data set, in which the taxi and fare files have been pre-joined:
Spark2.0-pySpark3-machine-learning-data-science-spark-advanced-data-
exploration-modeling.ipynb . This notebook takes a much shorter time to finish
(2-3 mins) and may be a good starting point for quickly exploring the code we have
provided for Spark 2.0.
For guidance on the operationalization of a Spark 2.0 model and model consumption
for scoring, see the Spark 1.6 document on consumption for an example outlining the
steps required. To use this example on Spark 2.0, replace the Python code file with this
file .
Prerequisites
The following procedures are related to Spark 1.6. For the Spark 2.0 version, use the
notebooks described and linked to previously.
1. You must have an Azure subscription. If you do not already have one, see Get
Azure free trial .
2. You need a Spark 1.6 cluster to complete this walkthrough. To create one, see the
instructions provided in Get started: create Apache Spark on Azure HDInsight. The
cluster type and version is specified from the Select Cluster Type menu.
7 Note
For a topic that shows how to use Scala rather than Python to complete tasks for an
end-to-end data science process, see the Data Science using Scala with Spark on
Azure.
2 Warning
Billing for HDInsight clusters is prorated per minute, whether you use them or
not. Be sure to delete your cluster after you finish using it. See how to delete
an HDInsight cluster.
1. The 'trip_data' CSV files contain trip details, such as number of passengers, pick up
and dropoff points, trip duration, and trip length. Here are a few sample records:
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,
dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longit
ude,pickup_latitude,dropoff_longitude,dropoff_latitude
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013
-01-01 15:11:48,2013-01-01
15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013
-01-06 00:18:35,2013-01-06
00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013
-01-05 18:49:41,2013-01-05
18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013
-01-07 23:54:15,2013-01-07
23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013
-01-07 23:25:03,2013-01-07
23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868
2. The 'trip_fare' CSV files contain details of the fare paid for each trip, such as
payment type, fare amount, surcharge and taxes, tips and tolls, and the total
amount paid. Here are a few sample records:
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-
01 15:11:48,CSH,6.5,0,0.5,0,0,7
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-
06 00:18:35,CSH,6,0.5,0.5,0,0,7
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-
05 18:49:41,CSH,5.5,1,0.5,0,0,7
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-
07 23:54:15,CSH,5,0.5,0.5,0,0,6
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-
07 23:25:03,CSH,9.5,0.5,0.5,0,0,10.5
We have taken a 0.1% sample of these files and joined the trip_data and trip_fare CVS
files into a single dataset to use as the input dataset for this walkthrough. The unique
key to join trip_data and trip_fare is composed of the fields: medallion, hack_licence and
pickup_datetime. Each record of the dataset contains the following attributes
representing a NYC Taxi trip:
surcharge Surcharge
tip_class Tip class (0: $0, 1: $0-5, 2: $6-10, 3: $11-20, 4: > $20)
You see the file name on your Jupyter file list with an Upload button again. Click this
Upload button. Now you have imported the notebook. Repeat these steps to upload the
other notebooks from this walkthrough.
Tip
You can right-click the links on your browser and select Copy Link to get the
GitHub raw content URL. You can paste this URL into the Jupyter Upload file
explorer dialog box.
Tip
The PySpark kernel automatically visualizes the output of SQL (HiveQL) queries. You
are given the option to select among several different types of visualizations (Table,
Pie, Line, Area, or Bar) by using the Type menu buttons in the notebook:
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Data Science using Scala and Spark on
Azure
Article • 11/21/2022
This article shows you how to use Scala for supervised machine learning tasks with the
Spark scalable MLlib and Spark ML packages on an Azure HDInsight Spark cluster. It
walks you through the tasks that constitute the Data Science process: data ingestion and
exploration, visualization, feature engineering, modeling, and model consumption. The
models in the article include logistic and linear regression, random forests, and
gradient-boosted trees (GBTs), in addition to two common supervised machine learning
tasks:
Regression problem: Prediction of the tip amount ($) for a taxi trip
Binary classification: Prediction of tip or no tip (1/0) for a taxi trip
The modeling process requires training and evaluation on a test data set and relevant
accuracy metrics. In this article, you can learn how to store these models in Azure Blob
storage and how to score and evaluate their predictive performance. This article also
covers the more advanced topics of how to optimize models by using cross-validation
and hyper-parameter sweeping. The data used is a sample of the 2013 NYC taxi trip and
fare data set available on GitHub.
Scala , a language based on the Java virtual machine, integrates object-oriented and
functional language concepts. It's a scalable language that is well suited to distributed
processing in the cloud, and runs on Azure Spark clusters.
The setup steps and code in this article are for Azure HDInsight 3.4 Spark 1.6. However,
the code in this article and in the Scala Jupyter Notebook are generic and should work
on any Spark cluster. The cluster setup and management steps might be slightly
different from what is shown in this article if you are not using HDInsight Spark.
7 Note
For a topic that shows you how to use Python rather than Scala to complete tasks
for an end-to-end Data Science process, see Data Science using Spark on Azure
HDInsight.
Prerequisites
You must have an Azure subscription. If you do not already have one, get an Azure
free trial .
You need an Azure HDInsight 3.4 Spark 1.6 cluster to complete the following
procedures. To create a cluster, see the instructions in Get started: Create Apache
Spark on Azure HDInsight. Set the cluster type and version on the Select Cluster
Type menu.
2 Warning
Billing for HDInsight clusters is prorated per minute, whether you use them or
not. Be sure to delete your cluster after you finish using it. See how to delete
an HDInsight cluster.
For a description of the NYC taxi trip data and instructions on how to execute code from
a Jupyter notebook on the Spark cluster, see the relevant sections in Overview of Data
Science using Spark on Azure HDInsight.
Select Scala to see a directory that has a few examples of prepackaged notebooks that
use the PySpark API. The Exploration Modeling and Scoring using Scala.ipynb notebook
that contains the code samples for this suite of Spark topics is available on GitHub .
You can upload the notebook directly from GitHub to the Jupyter Notebook server on
your Spark cluster. On your Jupyter home page, click the Upload button. In the file
explorer, paste the GitHub (raw content) URL of the Scala notebook, and then click
Open. The Scala notebook is available at the following URL:
Exploration-Modeling-and-Scoring-using-Scala.ipynb
The Spark kernels that are provided with Jupyter notebooks have preset contexts. You
don't need to explicitly set the Spark or Hive contexts before you start working with the
application you are developing. The preset contexts are:
sc for SparkContext
sqlContext for HiveContext
Spark magics
The Spark kernel provides some predefined "magics," which are special commands that
you can call with %% . Two of these commands are used in the following code samples.
%%local specifies that the code in subsequent lines will be executed locally. The
code must be valid Scala code.
%%sql -o <variable name> executes a Hive query against sqlContext . If the -o
parameter is passed, the result of the query is persisted in the %%local Scala
context as a Spark data frame.
For more information about the kernels for Jupyter notebooks and their predefined
"magics" that you call with %% (for example, %%local ), see Kernels available for Jupyter
notebooks with HDInsight Spark Linux clusters on HDInsight.
Import libraries
Import the Spark, MLlib, and other libraries you'll need by using the following code.
Scala
# SPECIFY SQLCONTEXT
val sqlContext = new SQLContext(sc)
Data ingestion
The first step in the Data Science process is to ingest the data that you want to analyze.
You bring the data from external sources or systems where it resides into your data
exploration and modeling environment. In this article, the data you ingest is a joined
0.1% sample of the taxi trip and fare file (stored as a .tsv file). The data exploration and
modeling environment is Spark. This section contains the code to complete the
following series of tasks:
To save models or files in Blob storage, you need to properly specify the path. Reference
the default container attached to the Spark cluster by using a path that begins with
wasb:/// . Reference other locations by using wasb:// .
The following code sample specifies the location of the input data to be read and the
path to Blob storage that is attached to the Spark cluster where the model will be saved.
Scala
# CREATE AN INITIAL DATA FRAME AND DROP COLUMNS, AND THEN CREATE A CLEANED
DATA FRAME BY FILTERING FOR UNWANTED VALUES OR OUTLIERS
val taxi_train_df = sqlContext.createDataFrame(taxi_temp, taxi_schema)
val taxi_df_train_cleaned =
(taxi_train_df.drop(taxi_train_df.col("medallion"))
.drop(taxi_train_df.col("hack_license")).drop(taxi_train_df.col("store_and_f
wd_flag"))
.drop(taxi_train_df.col("pickup_datetime")).drop(taxi_train_df.col("dropoff_
datetime"))
.drop(taxi_train_df.col("pickup_longitude")).drop(taxi_train_df.col("pickup_
latitude"))
.drop(taxi_train_df.col("dropoff_longitude")).drop(taxi_train_df.col("dropof
f_latitude"))
.drop(taxi_train_df.col("surcharge")).drop(taxi_train_df.col("mta_tax"))
.drop(taxi_train_df.col("direct_distance")).drop(taxi_train_df.col("tolls_am
ount"))
.drop(taxi_train_df.col("total_amount")).drop(taxi_train_df.col("tip_class")
)
.filter("passenger_count > 0 and passenger_count < 8 AND
payment_type in ('CSH', 'CRD') AND tip_amount >= 0 AND tip_amount < 30 AND
fare_amount >= 1 AND fare_amount < 150 AND trip_distance > 0 AND
trip_distance < 100 AND trip_time_in_secs > 30 AND trip_time_in_secs <
7200"));
Output:
Scala
Output:
which is the head node of the HDInsight cluster. Typically, you use %%local magic
in conjunction with the %%sql magic with the -o parameter. The -o parameter
would persist the output of the SQL query locally, and then %%local magic would
trigger the next set of code snippet to run locally against the output of the SQL
queries that is persisted locally.
Scala
In the following code, the %%local magic creates a local data frame, sqlResults. You can
use sqlResults to plot by using matplotlib.
Tip
Local magic is used multiple times in this article. If your data set is large, please
sample to create a data frame that can fit in local memory.
Scala
The Spark kernel automatically visualizes the output of SQL (HiveQL) queries after you
run the code. You can choose between several types of visualizations:
Table
Pie
Line
Area
Bar
Scala
# RUN THE CODE LOCALLY ON THE JUPYTER SERVER AND IMPORT LIBRARIES
%%local
import matplotlib.pyplot as plt
%matplotlib inline
Output:
Create features and transform features, and
then prep data for input into modeling
functions
For tree-based modeling functions from Spark ML and MLlib, you have to prepare target
and features by using a variety of techniques, such as binning, indexing, one-hot
encoding, and vectorization. Here are the procedures to follow in this section:
Scala
# CACHE THE DATA FRAME IN MEMORY AND MATERIALIZE THE DATA FRAME IN MEMORY
taxi_df_train_with_newFeatures.cache()
taxi_df_train_with_newFeatures.count()
You need to index or encode your models in different ways, depending on the model.
For example, logistic and linear regression models require one-hot encoding. For
example, a feature with three categories can be expanded into three feature columns.
Each column would contain 0 or 1 depending on the category of an observation. MLlib
provides the OneHotEncoder function for one-hot encoding. This encoder maps a
column of label indices to a column of binary vectors with at most a single one-value.
With this encoding, algorithms that expect numerical valued features, such as logistic
regression, can be applied to categorical features.
Here you transform only four variables to show examples, which are character strings.
You also can index other variables, such as weekday, represented by numerical values, as
categorical variables.
For indexing, use StringIndexer() , and for one-hot encoding, use OneHotEncoder()
functions from MLlib. Here is the code to index and encode categorical features:
Scala
Output:
Sample and split the data set into training and test
fractions
This code creates a random sampling of the data (25%, in this example). Although
sampling is not required for this example due to the size of the data set, the article
shows you how you can sample so that you know how to use it for your own problems
when needed. When samples are large, this can save significant time while you train
models. Next, split the sample into a training part (75%, in this example) and a testing
part (25%, in this example) to use in classification and regression modeling.
Add a random number (between 0 and 1) to each row (in a "rand" column) that can be
used to select cross-validation folds during training.
Scala
# SPLIT THE SAMPLED DATA FRAME INTO TRAIN AND TEST, WITH A RANDOM COLUMN
ADDED FOR DOING CROSS-VALIDATION (SHOWN LATER)
# INCLUDE A RANDOM COLUMN FOR CREATING CROSS-VALIDATION FOLDS
val splits = encodedFinalSampled.randomSplit(Array(trainingFraction,
testingFraction), seed = seed)
val trainData = splits(0)
val testData = splits(1)
Output:
In this code, you specify the target (dependent) variable and the features to use to train
models. Then, you create indexed or one-hot encoded training and testing input labeled
point RDDs or data frames.
Scala
# CREATE INDEXED DATA FRAMES THAT YOU CAN USE TO TRAIN BY USING SPARK ML
FUNCTIONS
val indexedTRAINbinaryDF = indexedTRAINbinary.toDF()
val indexedTESTbinaryDF = indexedTESTbinary.toDF()
val indexedTRAINregDF = indexedTRAINreg.toDF()
val indexedTESTregDF = indexedTESTreg.toDF()
# CREATE ONE-HOT ENCODED (VECTORIZED) DATA FRAMES THAT YOU CAN USE TO TRAIN
BY USING SPARK ML FUNCTIONS
val assemblerOneHot = new VectorAssembler().setInputCols(Array("paymentVec",
"vendorVec", "rateVec", "TrafficTimeBinsVec", "pickup_hour", "weekday",
"passenger_count", "trip_time_in_secs", "trip_distance",
"fare_amount")).setOutputCol("features")
val OneHotTRAIN = assemblerOneHot.transform(trainData)
val OneHotTEST = assemblerOneHot.transform(testData)
Output:
Time to run the cell: 4 seconds.
Scala
# CATEGORIZE FEATURES AND BINARIZE THE TARGET FOR THE BINARY CLASSIFICATION
PROBLEM
# TRAIN DATA
val indexer = new
VectorIndexer().setInputCol("features").setOutputCol("featuresCat").setMaxCa
tegories(32)
val indexerModel = indexer.fit(indexedTRAINbinaryDF)
val indexedTrainwithCatFeat = indexerModel.transform(indexedTRAINbinaryDF)
val binarizer: Binarizer = new
Binarizer().setInputCol("label").setOutputCol("labelBin").setThreshold(0.5)
val indexedTRAINwithCatFeatBinTarget =
binarizer.transform(indexedTrainwithCatFeat)
# TEST DATA
val indexerModel = indexer.fit(indexedTESTbinaryDF)
val indexedTrainwithCatFeat = indexerModel.transform(indexedTESTbinaryDF)
val binarizer: Binarizer = new
Binarizer().setInputCol("label").setOutputCol("labelBin").setThreshold(0.5)
val indexedTESTwithCatFeatBinTarget =
binarizer.transform(indexedTrainwithCatFeat)
# TRAIN DATA
val indexer = new
VectorIndexer().setInputCol("features").setOutputCol("featuresCat").setMaxCa
tegories(32)
val indexerModel = indexer.fit(indexedTRAINregDF)
val indexedTRAINwithCatFeat = indexerModel.transform(indexedTRAINregDF)
# TEST DATA
val indexerModel = indexer.fit(indexedTESTbinaryDF)
val indexedTESTwithCatFeat = indexerModel.transform(indexedTESTregDF)
Scala
Scala
# LOAD THE SAVED MODEL AND SCORE THE TEST DATA SET
val savedModel =
org.apache.spark.ml.classification.LogisticRegressionModel.load(filename)
println(s"Coefficients: ${savedModel.coefficients} Intercept:
${savedModel.intercept}")
Output:
Use Python on local Pandas data frames to plot the ROC curve.
Scala
# QUERY THE RESULTS
%%sql -q -o sqlResults
SELECT tipped, probability from testResults
# RUN THE CODE LOCALLY ON THE JUPYTER SERVER AND IMPORT LIBRARIES
%%local
%matplotlib inline
from sklearn.metrics import roc_curve,auc
Output:
Create a random forest classification model
Next, create a random forest classification model by using the Spark ML
RandomForestClassifier() function, and then evaluate the model on test data.
Scala
Output:
Scala
# TRAIN A GBT CLASSIFICATION MODEL BY USING MLLIB AND A LABELED POINT
# EVALUATE THE MODEL ON TEST INSTANCES AND THE COMPUTE TEST ERROR
val labelAndPreds = indexedTESTbinary.map { point =>
val prediction = gbtModel.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
indexedTRAINbinary.count()
//println("Learned classification GBT model:\n" + gbtModel.toDebugString)
println("Test Error = " + testErr)
# USE BINARY AND MULTICLASS METRICS TO EVALUATE THE MODEL ON THE TEST DATA
val metrics = new MulticlassMetrics(labelAndPreds)
println(s"Precision: ${metrics.precision}")
println(s"Recall: ${metrics.recall}")
println(s"F1 Score: ${metrics.fMeasure}")
# SUMMARIZE THE MODEL OVER THE TRAINING SET AND PRINT METRICS
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: ${trainingSummary.objectiveHistory.toList}")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
Output:
Scala
# LOAD A SAVED LINEAR REGRESSION MODEL FROM BLOB STORAGE AND SCORE A TEST
DATA SET
Output:
Next, query the test results as a data frame and use AutoVizWidget and matplotlib to
visualize it.
SQL
The code creates a local data frame from the query output and plots the data. The
%%local magic creates a local data frame, sqlResults , which you can use to plot with
matplotlib.
7 Note
This Spark magic is used multiple times in this article. If the amount of data is large,
you should sample to create a data frame that can fit in local memory.
Scala
# RUN THE CODE LOCALLY ON THE JUPYTER SERVER AND IMPORT LIBRARIES
%%local
sqlResults
%matplotlib inline
import numpy as np
Output:
Gradient-boosted trees (GBTS) are ensembles of decision trees. GBTS trains decision
trees iteratively to minimize a loss function. You can use GBTS for regression and
classification. They can handle categorical features, do not require feature scaling, and
can capture nonlinearities and feature interactions. You also can use them in a
multiclass-classification setting.
Scala
# MAKE PREDICTIONS
val predictions = gbtModel.transform(indexedTESTwithCatFeat)
Output:
Split the data into train and validation sets, optimize the model by using hyper-
parameter sweeping on a training set, and evaluate on a validation set (linear
regression)
Optimize the model by using cross-validation and hyper-parameter sweeping by
using Spark ML's CrossValidator function (binary classification)
Optimize the model by using custom cross-validation and parameter-sweeping
code to use any machine learning function and parameter set (linear regression)
Cross-validation is a technique that assesses how well a model trained on a known set
of data will generalize to predict the features of data sets on which it has not been
trained. The general idea behind this technique is that a model is trained on a data set
of known data, and then the accuracy of its predictions is tested against an independent
data set. A common implementation is to divide a data set into k-folds, and then train
the model in a round-robin fashion on all but one of the folds.
Hyper-parameter optimization is the problem of choosing a set of hyper-parameters
for a learning algorithm, usually with the goal of optimizing a measure of the
algorithm's performance on an independent data set. A hyper-parameter is a value that
you must specify outside the model training procedure. Assumptions about hyper-
parameter values can affect the flexibility and accuracy of the model. Decision trees have
hyper-parameters, for example, such as the desired depth and number of leaves in the
tree. You must set a misclassification penalty term for a support vector machine (SVM).
Scala
# RUN THE TRAIN VALIDATION SPLIT AND CHOOSE THE BEST SET OF PARAMETERS
val model = trainValidationSplit.fit(OneHotTRAINLabeled)
# MAKE PREDICTIONS ON THE TEST DATA BY USING THE MODEL WITH THE COMBINATION
OF PARAMETERS THAT PERFORMS THE BEST
val testResults = model.transform(OneHotTESTLabeled).select("label",
"prediction")
Output:
Scala
# CREATE DATA FRAMES WITH PROPERLY LABELED COLUMNS TO USE WITH THE TRAIN AND
TEST SPLIT
val indexedTRAINwithCatFeatBinTargetRF =
indexedTRAINwithCatFeatBinTarget.select("labelBin","featuresCat").withColumn
Renamed(existingName="labelBin",newName="label").withColumnRenamed(existingN
ame="featuresCat",newName="features")
val indexedTESTwithCatFeatBinTargetRF =
indexedTESTwithCatFeatBinTarget.select("labelBin","featuresCat").withColumnR
enamed(existingName="labelBin",newName="label").withColumnRenamed(existingNa
me="featuresCat",newName="features")
indexedTRAINwithCatFeatBinTargetRF.cache()
indexedTESTwithCatFeatBinTargetRF.cache()
# RUN THE TRAIN VALIDATION SPLIT AND CHOOSE THE BEST SET OF PARAMETERS
val model = CrossValidator.fit(indexedTRAINwithCatFeatBinTargetRF)
# MAKE PREDICTIONS ON THE TEST DATA BY USING THE MODEL WITH THE COMBINATION
OF PARAMETERS THAT PERFORMS THE BEST
val testResults =
model.transform(indexedTESTwithCatFeatBinTargetRF).select("label",
"prediction")
Output:
Scala
val nFolds = 3
val numModels = paramGrid.size
val numParamsinGrid = 2
var maxDepth = -1
var numTrees = -1
var param = ""
var paramval = -1
var validateLB = -1.0
var validateUB = -1.0
val h = 1.0 / nFolds;
val RMSE = Array.fill(numModels)(0.0)
# CREATE K-FOLDS
val splits = MLUtils.kFold(indexedTRAINbinary, numFolds = nFolds, seed=1234)
# LOOP THROUGH K-FOLDS AND THE PARAMETER GRID TO GET AND IDENTIFY THE BEST
PARAMETER SET BY LEVEL OF ACCURACY
for (i <- 0 to (nFolds-1)) {
validateLB = i * h
validateUB = (i + 1) * h
val validationCV = trainData.filter($"rand" >= validateLB && $"rand" <
validateUB)
val trainCV = trainData.filter($"rand" < validateLB || $"rand" >=
validateUB)
val validationLabPt = validationCV.rdd.map(r =>
LabeledPoint(r.getDouble(targetIndRegression(0).toInt),
Vectors.dense(featuresIndIndex.map(r.getDouble(_)).toArray)));
val trainCVLabPt = trainCV.rdd.map(r =>
LabeledPoint(r.getDouble(targetIndRegression(0).toInt),
Vectors.dense(featuresIndIndex.map(r.getDouble(_)).toArray)));
validationLabPt.cache()
trainCVLabPt.cache()
featureSubsetStrategy="auto",impurity="variance", maxBins=32)
val labelAndPreds = validationLabPt.map { point =>
val prediction =
rfModel.predict(point.features)
( prediction, point.label
)
}
val validMetrics = new RegressionMetrics(labelAndPreds)
val rmse = validMetrics.rootMeanSquaredError
RMSE(nParamSets) += rmse
}
validationLabPt.unpersist();
trainCVLabPt.unpersist();
}
val minRMSEindex = RMSE.indexOf(RMSE.min)
# CREATE THE BEST MODEL WITH THE BEST PARAMETERS AND A FULL TRAINING DATA
SET
val best_rfModel = RandomForest.trainRegressor(indexedTRAINreg,
categoricalFeaturesInfo=categoricalFeaturesInfo,
numTrees=best_numTrees,
maxDepth=best_maxDepth,
featureSubsetStrategy="auto",impurity="variance", maxBins=32)
# PREDICT ON THE TRAINING SET WITH THE BEST MODEL AND THEN EVALUATE
val labelAndPreds = indexedTESTreg.map { point =>
val prediction =
best_rfModel.predict(point.features)
( prediction, point.label )
}
Output:
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Next steps
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Feature engineering in machine learning
Article • 01/06/2023
7 Note
This item is under maintenance. We encourage you to use the Azure Machine
Learning designer .
) Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We
recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning
Studio (classic) resources. Through 31 August 2024, you can continue to use the
existing Machine Learning Studio (classic) resources.
ML Studio (classic) documentation is being retired and may not be updated in the
future.
In this article, you learn about feature engineering and its role in enhancing data in
machine learning. Learn from illustrative examples drawn from Azure Machine Learning
Studio (classic) experiments.
Feature engineering: The process of creating new features from raw data to
increase the predictive power of the learning algorithm. Engineered features
should capture additional information that is not easily apparent in the original
feature set.
Feature selection: The process of selecting the key subset of features to reduce the
dimensionality of the training problem.
Normally feature engineering is applied first to generate additional features, and then
feature selection is done to eliminate irrelevant, redundant, or highly correlated
features.
Feature engineering and selection are part of the modeling stage of the Team Data
Science Process (TDSP). To learn more about the TDSP and the data science lifecycle, see
What is the TDSP?
Although many of the raw data fields can be used directly to train a model, it's often
necessary to create additional (engineered) features for an enhanced training dataset.
Engineered features that enhance training provide information that better differentiates
the patterns in the data. But this process is something of an art. Sound and productive
decisions often require domain expertise.
Besides feature set A, which already exists in the original raw data, the other three sets
of features are created through the feature engineering process. Feature set B captures
recent demand for the bikes. Feature set C captures the demand for bikes at a particular
hour. Feature set D captures demand for bikes at particular hour and particular day of
the week. The four training datasets each includes feature set A, A+B, A+B+C, and
A+B+C+D, respectively.
The following figure demonstrates the R script used to create feature set B in the second
left branch.
Results
A comparison of the performance results of the four models is summarized in the
following table:
The best results are shown by features A+B+C. The error rate decreases when additional
feature set are included in the training data. It verifies the presumption that the feature
set B, C provide additional relevant information for the regression task. But adding the D
feature does not seem to provide any additional reduction in the error rate.
Feature hashing
To achieve this task, a technique called feature hashing is applied to efficiently turn
arbitrary text features into indices. Instead of associating each text feature
(words/phrases) to a particular index, this method applies a hash function to the
features and using their hash values as indices directly.
In Studio (classic), there is a Feature Hashing module that creates word/phrase features
conveniently. Following figure shows an example of using this module. The input
dataset contains two columns: the book rating ranging from 1 to 5, and the actual
review content. The goal of this module is to retrieve a bunch of new features that show
the occurrence frequency of the corresponding word(s)/phrase(s) within the particular
book review. To use this module, complete the following steps:
First, select the column that contains the input text ("Col2" in this example).
Second, set the "Hashing bitsize" to 8, which means 2^8=256 features will be
created. The word/phase in all the text will be hashed to 256 indices. The
parameter "Hashing bitsize" ranges from 1 to 31. The word(s)/phrase(s) are less
likely to be hashed into the same index if setting it to be a larger number.
Third, set the parameter "N-grams" to 2. This value gets the occurrence frequency
of unigrams (a feature for every single word) and bigrams (a feature for every pair
of adjacent words) from the input text. The parameter "N-grams" ranges from 0 to
10, which indicates the maximum number of sequential words to be included in a
feature.
The following figure shows what these new feature look like.
Conclusion
Engineered and selected features increase the efficiency of the training process, which
attempts to extract the key information contained in the data. They also improve the
power of these models to classify the input data accurately and to predict outcomes of
interest more robustly.
Feature engineering and selection can also combine to make the learning more
computationally tractable. It does so by enhancing and then reducing the number of
features needed to calibrate or train a model. Mathematically, the selected features are a
minimal set of independent variables that explain the patterns in the data and predict
outcomes successfully.
Principal author:
Next steps
To create features for data in specific environments, see the following articles:
Related resources
Feature selection in the Team Data Science Process
Modeling stage of the Team Data Science Process lifecycle
What is the Team Data Science Process?
Create features for data in SQL Server
using SQL and Python
Article • 05/30/2023
This document shows how to generate features for data stored in a SQL Server VM on
Azure that help algorithms learn more efficiently from the data. You can use SQL or a
programming language like Python to accomplish this task. Both approaches are
demonstrated here.
7 Note
For a practical example, you can consult the NYC Taxi dataset and refer to the
IPNB titled NYC Data wrangling using IPython Notebook and SQL Server for an
end-to-end walk-through.
Prerequisites
This article assumes that you have:
Created an Azure storage account. If you need instructions, see Create an Azure
Storage account
Stored your data in SQL Server. If you have not, see Move data to an Azure SQL
Database for Azure Machine Learning for instructions on how to move the data
there.
7 Note
Once you generate additional features, you can either add them as columns to the
existing table or create a new table with the additional features and primary key,
that can be joined with the original table.
SQL
SQL
Here is a brief primer on latitude and longitude location data from Stack Overflow .
Here are some useful things to understand about location data before creating features
from the field:
The sign indicates whether we are north or south, east or west on the globe.
A nonzero hundreds digit indicates longitude, not latitude is being used.
The tens digit gives a position to about 1,000 kilometers. It gives useful
information about what continent or ocean we are on.
The units digit (one decimal degree) gives a position up to 111 kilometers (60
nautical miles, about 69 miles). It indicates, roughly, what large state or
country/region we are in.
The first decimal place is worth up to 11.1 km: it can distinguish the position of one
large city from a neighboring large city.
The second decimal place is worth up to 1.1 km: it can separate one village from
the next.
The third decimal place is worth up to 110 m: it can identify a large agricultural
field or institutional campus.
The fourth decimal place is worth up to 11 m: it can identify a parcel of land. It is
comparable to the typical accuracy of an uncorrected GPS unit with no
interference.
The fifth decimal place is worth up to 1.1 m: it distinguishes trees from each other.
Accuracy to this level with commercial GPS units can only be achieved with
differential correction.
The sixth decimal place is worth up to 0.11 m: you can use this level for laying out
structures in detail, for designing landscapes, building roads. It should be more
than good enough for tracking movements of glaciers and rivers. This goal can be
achieved by taking painstaking measures with GPS, such as differentially corrected
GPS.
The location information can be featurized by separating out region, location, and city
information. Once can also call a REST endpoint, such as Bing Maps API (see
https://msdn.microsoft.com/library/ff701710.aspx to get the region/district
information).
SQL
select
<location_columnname>
,round(<location_columnname>,0) as l1
,l2=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 1 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),1,1) else '0' end
,l3=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 2 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),2,1) else '0' end
,l4=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 3 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),3,1) else '0' end
,l5=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 4 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),4,1) else '0' end
,l6=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 5 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),5,1) else '0' end
,l7=case when LEN (PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1)) >= 6 then
substring(PARSENAME(round(ABS(<location_columnname>) -
FLOOR(ABS(<location_columnname>)),6),1),6,1) else '0' end
from <tablename>
These location-based features can be further used to generate additional count features
as described earlier.
Tip
You can programmatically insert the records using your language of choice. You
may need to insert the data in chunks to improve write efficiency. Here is an
example of how to do this using pyodbc . Another alternative is to insert data in
the database using BCP utility
The following connection string format can be used to connect to a SQL Server
database from Python using pyodbc (replace servername, dbname, username, and
password with your specific values):
Python
The Pandas library in Python provides a rich set of data structures and data analysis
tools for data manipulation for Python programming. The following code reads the
results returned from a SQL Server database into a Pandas data frame:
Python
# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select <columnname1>, <columnname2>... from
<tablename>''', conn)
Now you may work with the Pandas data frame as covered in topics Create features for
Azure blob storage data using Panda.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This document shows how to create features for data stored in an Azure HDInsight
Hadoop cluster using Hive queries. These Hive queries use embedded Hive User-
Defined Functions (UDFs), the scripts for which are provided.
The operations needed to create features can be memory intensive. The performance of
Hive queries becomes more critical in such cases and can be improved by tuning certain
parameters. The tuning of these parameters is discussed in the final section.
Examples of the queries that are presented are specific to the NYC Taxi Trip Data
scenarios are also provided in GitHub repository . These queries already have data
schema specified and are ready to be submitted to run. In the final section, parameters
that users can tune so that the performance of Hive queries can be improved are also
discussed.
Prerequisites
This article assumes that you have:
Created an Azure storage account. If you need instructions, see Create an Azure
Storage account
Provisioned a customized Hadoop cluster with the HDInsight service. If you need
instructions, see Customize Azure HDInsight Hadoop Clusters for Advanced
Analytics.
The data has been uploaded to Hive tables in Azure HDInsight Hadoop clusters. If
it has not, follow Create and load data to Hive tables to upload data to Hive tables
first.
Enabled remote access to the cluster. If you need instructions, see Access the Head
Node of Hadoop Cluster.
Feature generation
In this section, several examples of the ways in which features can be generating using
Hive queries are described. Once you have generated additional features, you can either
add them as columns to the existing table or create a new table with the additional
features and primary key, which can then be joined with the original table. Here are the
examples presented:
HiveQL
select
a.<column_name1>, a.<column_name2>, a.sub_count/sum(a.sub_count) over ()
as frequency
from
(
select
<column_name1>,<column_name2>, count(*) as sub_count
from <databasename>.<tablename> group by <column_name1>, <column_name2>
)a
order by frequency desc;
HiveQL
set smooth_param1=1;
set smooth_param2=20;
select
<column_name1>,<column_name2>,
ln((sum_target+${hiveconf:smooth_param1})/(record_count-
sum_target+${hiveconf:smooth_param2}-${hiveconf:smooth_param1})) as risk
from
(
select
<column_nam1>, <column_name2>, sum(binary_target) as sum_target,
sum(1) as record_count
from
(
select
<column_name1>, <column_name2>, if(target_column>0,1,0) as
binary_target
from <databasename>.<tablename>
)a
group by <column_name1>, <column_name2>
)b
In this example, variables smooth_param1 and smooth_param2 are set to smooth the risk
values calculated from the data. Risks have a range between -Inf and Inf. A risk > 0
indicates that the probability that the target is equal to 1 is greater than 0.5.
After the risk table is calculated, users can assign risk values to a table by joining it with
the risk table. The Hive joining query was provided in previous section.
HiveQL
This Hive query assumes that the <datetime field> is in the default datetime format.
If a datetime field is not in the default format, you need to convert the datetime field
into Unix time stamp first, and then convert the Unix time stamp to a datetime string
that is in the default format. When the datetime is in default format, users can apply the
embedded datetime UDFs to extract features.
HiveQL
HiveQL
The hivesampletable in this query comes preinstalled on all Azure HDInsight Hadoop
clusters by default when the clusters are provisioned.
HiveQL
The fields that are used in this query are the GPS coordinates of pickup and dropoff
locations, named pickup_longitude, pickup_latitude, dropoff_longitude, and
dropoff_latitude. The queries that calculate the direct distance between the pickup and
dropoff coordinates are:
HiveQL
set R=3959;
set pi=radians(180);
select pickup_longitude, pickup_latitude, dropoff_longitude,
dropoff_latitude,
${hiveconf:R}*2*2*atan((1-sqrt(1-pow(sin((dropoff_latitude-
pickup_latitude)
*${hiveconf:pi}/180/2),2)-cos(pickup_latitude*${hiveconf:pi}/180)
*cos(dropoff_latitude*${hiveconf:pi}/180)*pow(sin((dropoff_longitude-
pickup_longitude)*${hiveconf:pi}/180/2),2)))
/sqrt(pow(sin((dropoff_latitude-
pickup_latitude)*${hiveconf:pi}/180/2),2)
+cos(pickup_latitude*${hiveconf:pi}/180)*cos(dropoff_latitude*${hiveconf:pi}
/180)*
pow(sin((dropoff_longitude-pickup_longitude)*${hiveconf:pi}/180/2),2)))
as direct_distance
from nyctaxi.trip
where pickup_longitude between -90 and 0
and pickup_latitude between 30 and 90
and dropoff_longitude between -90 and 0
and dropoff_latitude between 30 and 90
limit 10;
The mathematical equations that calculate the distance between two GPS coordinates
can be found on the Movable Type Scripts site, authored by Peter Lapisu. In this
Javascript, the function toRad() is just lat_or_lonpi/180, which converts degrees to
radians. Here, lat_or_lon is the latitude or longitude. Since Hive does not provide the
function atan2 , but provides the function atan , the atan2 function is implemented by
atan function in the above Hive query using the definition provided in Wikipedia .
A full list of Hive embedded UDFs can be found in the Built-in Functions section on the
Apache Hive wiki ).
1. Java heap space: For queries involving joining large datasets, or processing long
records, running out of heap space is one of the common errors. This error can be
avoided by setting parameters mapreduce.map.java.opts and
mapreduce.task.io.sort.mb to desired values. Here is an example:
HiveQL
set mapreduce.map.java.opts=-Xmx4096m;
set mapreduce.task.io.sort.mb=-Xmx1024m;
This parameter allocates 4-GB memory to Java heap space and also makes sorting
more efficient by allocating more memory for it. It is a good idea to play with these
allocations if there are any job failure errors related to heap space.
2. DFS block size: This parameter sets the smallest unit of data that the file system
stores. As an example, if the DFS block size is 128 MB, then any data of size less
than and up to 128 MB is stored in a single block. Data that is larger than 128 MB
is allotted extra blocks.
3. Choosing a small block size causes large overheads in Hadoop since the name
node has to process many more requests to find the relevant block pertaining to
the file. A recommended setting when dealing with gigabytes (or larger) data is:
HiveQL
set dfs.block.size=128m;
HiveQL
set hive.auto.convert.join=true;
5. Specifying the number of mappers to Hive: While Hadoop allows the user to set
the number of reducers, the number of mappers is typically not be set by the user.
A trick that allows some degree of control on this number is to choose the Hadoop
variables mapred.min.split.size and mapred.max.split.size as the size of each map
task is determined by:
HiveQL
mapred.min.split.size is 0, that of
mapred.max.split.size is Long.MAX and that of
dfs.block.size is 64 MB.
As we can see, given the data size, tuning these parameters by "setting" them
allows us to tune the number of mappers used.
6. Here are a few other more advanced options for optimizing Hive performance.
These options allow you to set the memory allocated to map and reduce tasks, and
can be useful in tweaking performance. Keep in mind that the
mapreduce.reduce.memory.mb cannot be greater than the physical memory size of
each worker node in the Hadoop cluster.
HiveQL
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
This article explains the purposes of feature selection and provides examples of its role
in the data enhancement process of machine learning. These examples are drawn from
Azure Machine Learning Studio.
The engineering and selection of features is one part of the Team Data Science Process
(TDSP) outlined in the article What is the Team Data Science Process?. Feature
engineering and selection are parts of the Develop features step of the TDSP.
Normally feature engineering is applied first to generate additional features, and then
the feature selection step is performed to eliminate irrelevant, redundant, or highly
correlated features.
Although feature selection does seek to reduce the number of features in the dataset
used to train the model, it is not referred to by the term "dimensionality reduction".
Feature selection methods extract a subset of original features in the data without
changing them. Dimensionality reduction methods employ engineered features that can
transform the original features and thus modify them. Examples of dimensionality
reduction methods include principal component analysis (PCA), canonical correlation
analysis, and singular value decomposition (SVD).
Among others, one widely applied category of feature selection methods in a supervised
context is called "filter-based feature selection". By evaluating the correlation between
each feature and the target attribute, these methods apply a statistical measure to
assign a score to each feature. The features are then ranked by the score, which may be
used to help set the threshold for keeping or eliminating a specific feature. Examples of
statistical measures used in these methods include Pearson correlation coefficient (PCC),
mutual information (MI), and the chi-squared test.
The Filter Based Feature Selection component in Azure Machine Learning designer helps
you identify the columns in your input dataset that have the greatest predictive power.
Conclusion
Feature engineering and feature selection are two commonly engineering techniques to
increase training efficiency. These techniques also improve the model's power to classify
the input data accurately and to predict outcomes of interest more robustly. Feature
engineering and selection can also combine to make the learning more computationally
efficient by enhancing and then reducing the number of features needed to calibrate or
train a model. Mathematically speaking, the features selected to train the model are a
minimal set of independent variables that explain the maximum variance in the data to
predict the outcome feature.
Principal author:
Production platforms
There are various approaches and platforms to put models into production. Here are a
few options:
7 Note
Prior to deployment, one has to ensure the latency of model scoring is low enough
to use in production.
7 Note
For deployment from Azure Machine Learning, see Deploy machine learning
models to Azure.
A/B testing
When multiple models are in production, A/B testing may be used to compare model
performance.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Mark Tabladillo | Senior Cloud Solution Architect
Next steps
What is the Team Data Science Process?
Compare the machine learning products and technologies from Microsoft
Machine learning at scale
Build and deploy a model using Azure
Synapse Analytics
Article • 05/30/2023
In this tutorial, we walk you through building and deploying a machine learning model
using Azure Synapse Analytics for a publicly available dataset -- the NYC Taxi Trips
dataset. The binary classification model constructed predicts whether or not a tip is paid for
a trip. Models include multiclass classification (whether or not there is a tip) and regression
(the distribution for the tip amounts paid).
The procedure follows the Team Data Science Process (TDSP) workflow. We show how to set
up a data science environment, how to load the data into Azure Synapse Analytics, and how
to use either Azure Synapse Analytics or a Jupyter Notebook to explore the data and
engineer features to model. We then show how to build and deploy a model with Azure
Machine Learning.
1. The trip_data.csv file contains trip details, such as number of passengers, pickup and
dropoff points, trip duration, and trip length. Here are a few sample records:
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_
datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latit
ude,dropoff_longitude,dropoff_latitude
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01
15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06
00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05
18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07
23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07
23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868
2. The trip_fare.csv file contains details of the fare paid for each trip, such as payment
type, fare amount, surcharge and taxes, tips and tolls, and the total amount paid. Here
are a few sample records:
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01
15:11:48,CSH,6.5,0,0.5,0,0,7
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06
00:18:35,CSH,6,0.5,0.5,0,0,7
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05
18:49:41,CSH,5.5,1,0.5,0,0,7
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07
23:54:15,CSH,5,0.5,0.5,0,0,6
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07
23:25:03,CSH,9.5,0.5,0.5,0,0,10.5
The unique key used to join trip_data and trip_fare is composed of the following three
fields:
medallion,
hack_license and
pickup_datetime.
1. Binary classification: To predict whether or not a tip was paid for a trip, that is, a
tip_amount that is greater than $0 is a positive example, while a tip_amount of $0 is a
negative example.
2. Multiclass classification: To predict the range of tip paid for the trip. We divide the
tip_amount into five bins or classes:
Class 0 : tip_amount = $0
Class 1 : tip_amount > $0 and tip_amount <= $5
When you provision your own Azure blob storage, choose a geo-location for your
Azure blob storage in or as close as possible to South Central US, which is where the
NYC Taxi data is stored. The data will be copied using AzCopy from the public blob
storage container to a container in your own storage account. The closer your Azure
blob storage is to South Central US, the faster this task (Step 4) will be completed.
To create your own Azure Storage account, follow the steps outlined at About Azure
Storage accounts. Be sure to make notes on the values for following storage account
credentials as they will be needed later in this walkthrough.
Storage Account Name
Storage Account Key
Container Name (which you want the data to be stored in the Azure blob storage)
Provision your Azure Synapse Analytics instance. Follow the documentation at Create and
query an Azure Synapse Analytics in the Azure portal to provision an Azure Synapse
Analytics instance. Make sure that you make notations on the following Azure Synapse
Analytics credentials that will be used in later steps.
Install Visual Studio and SQL Server Data Tools. For instructions, see Getting started with
Visual Studio 2019 for Azure Synapse Analytics.
Connect to your Azure Synapse Analytics with Visual Studio. For instructions, see steps 1
& 2 in Connect to SQL Analytics in Azure Synapse Analytics.
7 Note
Run the following SQL query on the database you created in your Azure Synapse
Analytics (instead of the query provided in step 3 of the connect topic,) to create a
master key.
SQL
BEGIN TRY
--Try to create the master key
CREATE MASTER KEY
END TRY
BEGIN CATCH
--If the master key exists, do nothing
END CATCH;
Create an Azure Machine Learning workspace under your Azure subscription. For
instructions, see Create an Azure Machine Learning workspace.
7 Note
You might need to Run as Administrator when executing the following PowerShell
script if your DestDir directory needs Administrator privilege to create or to write to it.
Azure PowerShell
$source = "https://raw.githubusercontent.com/Azure/Azure-MachineLearning-
DataScience/master/Misc/SQLDW/Download_Scripts_SQLDW_Walkthrough.ps1"
$ps1_dest = "$pwd\Download_Scripts_SQLDW_Walkthrough.ps1"
$wc = New-Object System.Net.WebClient
$wc.DownloadFile($source, $ps1_dest)
.\Download_Scripts_SQLDW_Walkthrough.ps1 –DestDir 'C:\tempSQLDW'
After successful execution, your current working directory changes to -DestDir. You should
be able to see screen like below:
In your -DestDir, execute the following PowerShell script in administrator mode:
Azure PowerShell
./SQLDW_Data_Import.ps1
When the PowerShell script runs for the first time, you will be asked to input the
information from your Azure Synapse Analytics and your Azure blob storage account. When
this PowerShell script completes running for the first time, the credentials you input will
have been written to a configuration file SQLDW.conf in the present working directory. The
future run of this PowerShell script file has the option to read all needed parameters from
this configuration file. If you need to change some parameters, you can choose to input the
parameters on the screen upon prompt by deleting this configuration file and inputting the
parameters values as prompted or to change the parameter values by editing the
SQLDW.conf file in your -DestDir directory.
7 Note
In order to avoid schema name conflicts with those that already exist in your Azure
Azure Synapse Analytics, when reading parameters directly from the SQLDW.conf file, a
3-digit random number is added to the schema name from the SQLDW.conf file as the
default schema name for each run. The PowerShell script may prompt you for a
schema name: the name may be specified at user discretion.
Azure PowerShell
$AzCopy_path = SearchAzCopy
if ($AzCopy_path -eq $null){
Write-Host "AzCopy.exe is not found in C:\Program Files*. Now,
start installing AzCopy..." -ForegroundColor "Yellow"
InstallAzCopy
$AzCopy_path = SearchAzCopy
}
$env_path = $env:Path
for ($i=0; $i -lt $AzCopy_path.count; $i++){
if ($AzCopy_path.count -eq 1){
$AzCopy_path_i = $AzCopy_path
} else {
$AzCopy_path_i = $AzCopy_path[$i]
}
if ($env_path -notlike '*' +$AzCopy_path_i+'*'){
Write-Host $AzCopy_path_i 'not in system path, add it...'
[Environment]::SetEnvironmentVariable("Path",
"$AzCopy_path_i;$env_path", "Machine")
$env:Path =
[System.Environment]::GetEnvironmentVariable("Path","Machine")
$env_path = $env:Path
}
Copies data to your private blob storage account from the public blob with AzCopy
Azure PowerShell
Create a schema
SQL
SQL
Create an external file format for a csv file. Data is uncompressed and fields are
separated with the pipe character.
SQL
Create external fare and trip tables for NYC taxi dataset in Azure blob storage.
SQL
Load data from external tables in Azure blob storage to Azure Synapse Analytics
SQL
Create a sample data table (NYCTaxi_Sample) and insert data to it from selecting
SQL queries on the trip and fare tables. (Some steps of this walkthrough need to
use this sample table.)
SQL
7 Note
Depending on the geographical location of your private blob storage account, the
process of copying data from a public blob to your private storage account can take
about 15 minutes, or even longer,and the process of loading data from your storage
account to your Azure Azure Synapse Analytics could take 20 minutes or longer.
You will have to decide what do if you have duplicate source and destination files.
7 Note
If the .csv files to be copied from the public blob storage to your private blob storage
account already exist in your private blob storage account, AzCopy will ask you
whether you want to overwrite them. If you do not want to overwrite them, input n
when prompted. If you want to overwrite all of them, input a when prompted. You can
also input y to overwrite .csv files individually.
You can use your own data. If your data is in your on-premises machine in your real life
application, you can still use AzCopy to upload on-premises data to your private Azure blob
storage. You only need to change the Source location, $Source =
"http://getgoing.blob.core.windows.net/public/nyctaxidataset" , in the AzCopy command
of the PowerShell script file to the local directory that contains your data.
Tip
If your data is already in your private Azure blob storage in your real life application,
you can skip the AzCopy step in the PowerShell script and directly upload the data to
Azure Azure Synapse Analytics. This will require additional edits of the script to tailor it
to the format of your data.
This PowerShell script also plugs in the Azure Synapse Analytics information into the data
exploration example files SQLDW_Explorations.sql, SQLDW_Explorations.ipynb, and
SQLDW_Explorations_Scripts.py so that these three files are ready to be tried out instantly
after the PowerShell script completes.
Connect to your Azure Synapse Analytics using Visual Studio with the Azure Synapse
Analytics login name and password and open up the SQL Object Explorer to confirm the
database and tables have been imported. Retrieve the SQLDW_Explorations.sql file.
7 Note
To open a Parallel Data Warehouse (PDW) query editor, use the New Query command
while your PDW is selected in the SQL Object Explorer. The standard SQL query editor
is not supported by PDW.
Here are the types of data exploration and feature generation tasks performed in this
section:
SQL
SQL
SQL
Output: The query should return a table with rows specifying the 13,369 medallions (taxis)
and the number of trips completed in 2013. The last column contains the count of the
number of trips completed.
SQL
SELECT medallion, hack_license, COUNT(*)
FROM <schemaname>.<nyctaxi_fare>
WHERE pickup_datetime BETWEEN '20130101' AND '20130131'
GROUP BY medallion, hack_license
HAVING COUNT(*) > 100
Output: The query should return a table with 13,369 rows specifying the 13,369 car/driver
IDs that have completed more that 100 trips in 2013. The last column contains the count of
the number of trips completed.
SQL
Output: The query returns 837,467 trips that have invalid longitude and/or latitude fields.
SQL
SQL
Output:
tip_class tip_freq
1 82230915
2 6198803
3 1932223
0 82264625
4 85765
SQL
SET QUOTED_IDENTIFIER ON
GO
IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name =
'fnCalculateDistance')
DROP FUNCTION fnCalculateDistance
GO
RETURNS float
AS
BEGIN
DECLARE @distance decimal(28, 10)
-- Convert to radians
SET @Lat1 = @Lat1 / 57.2958
SET @Long1 = @Long1 / 57.2958
SET @Lat2 = @Lat2 / 57.2958
SET @Long2 = @Long2 / 57.2958
-- Calculate distance
SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) *
COS(@Long2 - @Long1))
--Convert to miles
IF @distance <> 0
BEGIN
SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) /
@distance);
END
RETURN @distance
END
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name =
'fnCalculateDistance')
DROP FUNCTION fnCalculateDistance
GO
RETURNS float
AS
BEGIN
DECLARE @distance decimal(28, 10)
-- Convert to radians
SET @Lat1 = @Lat1 / 57.2958
SET @Long1 = @Long1 / 57.2958
SET @Lat2 = @Lat2 / 57.2958
SET @Long2 = @Long2 / 57.2958
-- Calculate distance
SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) *
COS(@Long2 - @Long1))
--Convert to miles
IF @distance <> 0
BEGIN
SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) /
@distance);
END
RETURN @distance
END
GO
Here is an example to call this function to generate features in your SQL query:
SQL
SQL
When you are ready to proceed to Azure Machine Learning, you may either:
1. Save the final SQL query to extract and sample the data and copy-paste the query
directly into a notebook in Azure Machine Learning, or
2. Persist the sampled and engineered data you plan to use for model building in a new
Azure Synapse Analytics table and access that table through a datastore in Azure
Machine Learning.
Data exploration and feature engineering in the
notebook
In this section, we will perform data exploration and feature generation using both Python
and SQL queries against the Azure Synapse Analytics created earlier. A sample notebook
named SQLDW_Explorations.ipynb and a Python script file
SQLDW_Explorations_Scripts.py have been downloaded to your local directory. They are
also available on GitHub . These two files are identical in Python scripts. The Python script
file is provided to you in case you would like to use Python without a notebook. These two
sample Python files were designed under Python 2.7.
The needed Azure Synapse Analytics information in the sample Jupyter Notebook and the
Python script file downloaded to your local machine has been plugged in by the PowerShell
script previously. They are executable without any modification.
If you have already set up an Azure Machine Learning workspace, you can directly upload
the sample Notebook to the AzureML Notebooks area. For directions on uploading a
notebook, see Run Jupyter Notebooks in your workspace
Note: In order to run the sample Jupyter Notebook or the Python script file, the following
Python packages are needed.
pandas
numpy
matplotlib
pyodbc
PyTables
When building advanced analytical solutions on Azure Machine Learning with large data,
here is the recommended sequence:
The followings are a few data exploration, data visualization, and feature engineering
examples. More data explorations can be found in the sample Jupyter Notebook and the
sample Python script file.
SQL
SERVER_NAME=<server name>
DATABASE_NAME=<database name>
USERID=<user name>
PASSWORD=<password>
DB_DRIVER = <database driver>
SQL
CONNECTION_STRING = 'DRIVER=
{'+DRIVER+'};SERVER='+SERVER_NAME+';DATABASE='+DATABASE_NAME+';UID='+USERID+';P
WD='+PASSWORD
conn = pyodbc.connect(CONNECTION_STRING)
nrows = pd.read_sql('''
SELECT SUM(rows) FROM sys.partitions
WHERE object_id = OBJECT_ID('<schemaname>.<nyctaxi_trip>')
''', conn)
ncols = pd.read_sql('''
SELECT COUNT(*) FROM information_schema.columns
WHERE table_name = ('<nyctaxi_trip>') AND table_schema = ('<schemaname>')
''', conn)
ncols = pd.read_sql('''
SELECT COUNT(*) FROM information_schema.columns
WHERE table_name = ('<nyctaxi_fare>') AND table_schema = ('<schemaname>')
''', conn)
t0 = time.time()
query = '''
SELECT TOP 10000 t.*, f.payment_type, f.fare_amount, f.surcharge,
f.mta_tax,
f.tolls_amount, f.total_amount, f.tip_amount
FROM <schemaname>.<nyctaxi_trip> t, <schemaname>.<nyctaxi_fare> f
WHERE datepart("mi",t.pickup_datetime) = 1
AND t.medallion = f.medallion
AND t.hack_license = f.hack_license
AND t.pickup_datetime = f.pickup_datetime
'''
t1 = time.time()
print 'Time to read the sample table is %f seconds' % (t1-t0)
Time to read the sample table is 14.096495 seconds. Number of rows and columns
retrieved = (1000, 21).
Descriptive statistics
Now you are ready to explore the sampled data. We start with looking at some descriptive
statistics for the trip_distance (or any other fields you choose to specify).
SQL
df1['trip_distance'].describe()
SQL
df1.boxplot(column='trip_distance',return_type='dict')
SQL
fig = plt.figure()
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
df1['trip_distance'].plot(ax=ax1,kind='kde', style='b-')
df1['trip_distance'].hist(ax=ax2, bins=100, color='k')
Visualization: Bar and line plots
In this example, we bin the trip distance into five bins and visualize the binning results.
SQL
We can plot the above bin distribution in a bar or line plot with:
SQL
pd.Series(trip_dist_bin_id).value_counts().plot(kind='bar')
and
SQL
pd.Series(trip_dist_bin_id).value_counts().plot(kind='line')
SQL
plt.scatter(df1['trip_time_in_secs'], df1['trip_distance'])
SQL
plt.scatter(df1['passenger_count'], df1['trip_distance'])
Data exploration on sampled data using SQL queries in
Jupyter notebook
In this section, we explore data distributions using the sampled data that is persisted in the
new table we created above. Similar explorations may be performed using the original
tables.
SQL
SQL
query = '''
SELECT tipped, count(*) AS tip_freq
FROM <schemaname>.<nyctaxi_sample>
GROUP BY tipped
'''
pd.read_sql(query, conn)
Exploration: Tip class distribution
SQL
query = '''
SELECT tip_class, count(*) AS tip_freq
FROM <schemaname>.<nyctaxi_sample>
GROUP BY tip_class
'''
SQL
tip_class_dist['tip_freq'].plot(kind='bar')
SQL
query = '''
SELECT CONVERT(date, dropoff_datetime) AS date, COUNT(*) AS c
FROM <schemaname>.<nyctaxi_sample>
GROUP BY CONVERT(date, dropoff_datetime)
'''
pd.read_sql(query,conn)
SQL
query = '''
SELECT medallion,count(*) AS c
FROM <schemaname>.<nyctaxi_sample>
GROUP BY medallion
'''
pd.read_sql(query,conn)
SQL
SQL
SQL
SQL
1. Binary classification: To predict whether or not a tip was paid for a trip.
2. Multiclass classification: To predict the range of tip paid, according to the previously
defined classes.
3. Regression task: To predict the amount of tip paid for a trip.
To begin the modeling exercise, log in to your Azure Machine Learning workspace. If you
have not yet created a machine learning workspace, see Create the workspace.
1. To get started with Azure Machine Learning, see What is Azure Machine Learning?
2. Sign in to the Azure portal .
3. The Machine Learning Home page provides a wealth of information, videos, tutorials,
links to the Modules Reference, and other resources. For more information about
Azure Machine Learning, see the Azure Machine Learning Documentation Center.
In this exercise, we have already explored and engineered the data in Azure Synapse
Analytics, and decided on the sample size to ingest in Azure Machine Learning. Here is the
procedure to build one or more of the prediction models:
1. Get the data into Azure Machine Learning. For details, see Data ingestion options for
Azure Machine Learning workflows
2. Connect to Synapse Analytics. For details, see Link Azure Synapse Analytics and Azure
Machine Learning workspaces and attach Apache Spark pools
) Important
In the modeling data extraction and sampling query examples provided in previous
sections, all labels for the three modeling exercises are included in the query. An
important (required) step in each of the modeling exercises is to exclude the
unnecessary labels for the other two problems, and any other target leaks. For
example, when using binary classification, use the label tipped and exclude the fields
tip_class, tip_amount, and total_amount. The latter are target leaks since they imply
the tip paid.
Summary
To recap what we have done in this walkthrough tutorial, you have created an Azure data
science environment, worked with a large public dataset, taking it through the Team Data
Science Process, all the way from data acquisition to model training, and then to the
deployment of an Azure Machine Learning web service.
License information
This sample walkthrough and its accompanying scripts and notebook(s) are shared by
Microsoft under the MIT license. Check the LICENSE.txt file in the directory of the sample
code on GitHub for more details.
References
NYC Taxi and Limousine Commission Research and Statistics
Search and query an enterprise
knowledge base by using Azure
OpenAI or Azure Cognitive Search
Azure Blob Storage Azure Cache for Redis Azure Cognitive Search Azure AI services
This article describes how to use Azure OpenAI Service or Azure Cognitive Search to
search documents in your enterprise data and retrieve results to provide a ChatGPT-
style question and answer experience. This solution describes two approaches:
Azure Cognitive Search approach: Use Azure Cognitive Search to search and
retrieve relevant text data based on a user query. This service supports full-text
search, semantic search, vector search, and hybrid search.
7 Note
In Azure Cognitive Search, the semantic search and vector search features are
currently in public preview.
1
Storage Function apps Azure Cache for Azure App 1
User
accounts Redis 2 Service 4
2 4
3
Vectorize Return top k Results passed
Translate Create
Extract text query matching content with prompt
(optional) embeddings
3
Download a Visio file of this architecture.
Dataflow
Documents to be ingested can come from various sources, like files on an FTP server,
email attachments, or web application attachments. These documents can be ingested
to Azure Blob Storage via services like Azure Logic Apps, Azure Functions, or Azure Data
Factory. Data Factory is optimal for transferring bulk data.
Embedding creation:
1. The document is ingested into Blob Storage, and an Azure function is triggered to
extract text from the documents.
3. If the documents are PDFs or images, an Azure function can call Azure AI
Document Intelligence to extract the text. If the document is an Excel, CSV, Word,
or text file, python code can be used to extract the text.
4. The extracted text is then chunked appropriately, and an Azure OpenAI embedding
model is used to convert each chunk to embeddings.
5. These embeddings are persisted to the vector database. This solution uses the
Enterprise tier of Azure Cache for Redis, but any vector database can be used.
2. The Azure OpenAI embedding model is used to convert the query into vector
embeddings.
3. A vector similarity search that uses this query vector in the vector database returns
the top k matching content. The matching content to be retrieved can be set
according to a threshold that’s defined by a similarity measure, like cosine
similarity.
4. The top k retrieved content and the system prompt are sent to the Azure OpenAI
language model, like GPT-3.5 Turbo or GPT-4.
5. The search results are presented as the answer to the search query that was
initiated by the user, or the search results can be used as the grounding data for a
multi-turn conversation scenario.
Architecture: Azure Cognitive Search pull
approach
Index creation Query and retrieval
Pull Query
API
1 2 1
Storage Azure Cognitive Azure App User
accounts Search Service
3 4
AI enrichment skillsets
(optional) Create system prompt Call language model
2 3
Download a Visio file of this architecture.
Index creation:
1. Azure Cognitive Search is used to create a search index of the documents in Blob
Storage. Azure Cognitive Search supports Blob Storage, so the pull model is used
to crawl the content, and the capability is implemented via indexers.
7 Note
Azure Cognitive Search supports other data sources for indexing when using
the pull model. Documents can also be indexed from multiple data sources
and consolidated into a single index.
If vector fields are added to the index schema, which loads the vector data for
indexing, vector search can be enabled by indexing that vector data. Vector data
can be generated via Azure OpenAI embeddings.
Query and retrieval:
2. The query is passed to Azure Cognitive Search via the search documents REST API.
The query type can be simple, which is optimal for full-text search, or full, which is
for advanced query constructs like regular expressions, fuzzy and wild card search,
and proximity search. If the query type is set to semantic, a semantic search is
performed on the documents, and the relevant content is retrieved. Azure
Cognitive Search also supports vector search and hybrid search, which requires the
user query to be converted to vector embeddings.
3. The retrieved content and the system prompt are sent to the Azure OpenAI
language model, like GPT-3.5 Turbo or GPT-4.
4. The search results are presented as the answer to the search query that was
initiated by the user, or the search results can be used as the grounding data for a
multi-turn conversation scenario.
Azure OpenAI
Azure AI Azure AI language model
Translator Document
1 Intelligence 2
Download a Visio file of this architecture.
Index creation:
The query and retrieval in this approach is the same as the pull approach earlier in this
article.
Components
Azure OpenAI provides REST API access to Azure OpenAI's language models
including the GPT-3, Codex, and the embedding model series for content
generation, summarization, semantic search, and natural language-to-code
translation. Access the service by using a REST API, Python SDK, or the web-based
interface in the Azure OpenAI Studio .
Azure Cognitive Search is a cloud service that provides infrastructure, APIs, and
tools for searching. Use Azure Cognitive Search to build search experiences over
private disparate content in web, mobile, and enterprise applications.
Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for various languages, such as .NET, Node.js, and Python.
Applications can access files in Blob Storage via HTTP or HTTPS. Blob Storage
has hot, cool, and archive access tiers to support cost optimization for storing large
amounts of data.
The Enterprise tier of Azure Cache for Redis provides managed Redis Enterprise
modules, like RediSearch, RedisBloom, RedisTimeSeries, and RedisJSON. Vector
fields allow vector similarity search, which supports real-time vector indexing
(brute force algorithm (FLAT) and hierarchical navigable small world algorithm
(HNSW)), real-time vector updates, and k-nearest neighbor search. Azure Cache for
Redis brings a critical low-latency and high-throughput data storage solution to
modern applications.
Alternatives
Depending on your scenario, you can add the following workflows.
To create vectorized data, you can use any embedding model. You can also use the
Azure AI services Vision image retrieval API to vectorize images. This tool is
available in private preview.
Use the Durable Functions extension for Azure Functions as a code-first integration
tool to perform text-processing steps, like reading handwriting, text, and tables,
and processing language to extract entities on data based on the size and scale of
the workload.
You can use any database for persistent storage of the extracted embeddings,
including:
Azure SQL Database
Azure Cosmos DB
Azure Database for PostgreSQL
Azure Database for MySQL
Scenario details
Manual processing is increasingly time-consuming, error-prone, and resource-intensive
due to the sheer volume of documents. Organizations that handle huge volumes of
documents, largely unstructured data of different formats like PDF, Excel, CSV, Word,
PowerPoint, and image formats, face a significant challenge processing scanned and
handwritten documents and forms from their customers.
These documents and forms contain critical information, such as personal details,
medical history, and damage assessment reports, which must be accurately extracted
and processed.
Organizations often already have their own knowledge base of information, which can
be used for answering questions with the most appropriate answer. You can use the
services and pipelines described in these solutions to create a source for search
mechanisms of documents.
Potential use cases
This solution provides value to organizations in industries like pharmaceutical
companies and financial services. It applies to any company that has a large number of
documents with embedded information. This AI-powered end-to-end search solution
can be used to extract meaningful information from the documents based on the user
query to provide a ChatGPT-style question and answer experience.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
What is Azure AI Document Intelligence?
What is Azure OpenAI?
What is Azure Machine Learning?
Introduction to Blob Storage
What is Azure AI Language?
Introduction to Azure Data Lake Storage Gen2
Azure QnA Maker client library
Create, train, and publish your QnA Maker knowledge base
What is question answering?
Related resources
Query-based document summarization
Automate document identification, classification, and search by using Durable
Functions
Index file content and metadata by using Azure Cognitive Search
AI enrichment with image and text processing
Extract and analyze call center
data
Azure Blob Storage Azure AI Speech Azure AI services Power BI
This article describes how to extract insights from customer conversations at a call
center by using Azure AI services and Azure OpenAI Service. Use these real-time and
post-call analytics to improve call center efficiency and customer satisfaction.
Architecture
Detailed call
history, summaries,
reasons for calling
Download a PowerPoint file of this architecture.
Dataflow
1. A phone call between an agent and a customer is recorded and stored in Azure
Blob Storage. Audio files are uploaded to an Azure Storage account via a
supported method, such as the UI-based tool, Azure Storage Explorer, or a Storage
SDK or API.
For batch mode transcription and personal data detection and redaction, use the
AI services Ingestion Client tool. The Ingestion Client tool uses a no-code approach
for call center transcription.
4. Azure OpenAI is used to process the transcript and extract entities, summarize the
conversation, and analyze sentiments. The processed output is stored in Blob
Storage and then analyzed and visualized by using other services. You can also
store the output in a datastore for keeping track of metadata and for reporting.
Use Azure OpenAI to process the stored transcription information.
Components
Blob Storage is the object storage solution for raw files in this scenario. Blob
Storage supports libraries for languages like .NET, Node.js, and Python.
Applications can access files on Blob Storage via HTTP or HTTPS. Blob Storage has
hot, cool, and archive access tiers for storing large amounts of data, which
optimizes cost.
Azure OpenAI provides access to the Azure OpenAI language models, including
GPT-3, Codex, and the embeddings model series, for content generation,
summarization, semantic search, and natural language-to-code translation. You
can access the service through REST APIs, Python SDK, or the web-based interface
in the Azure OpenAI Studio .
Azure AI Speech is an AI-based API that provides speech capabilities like speech-
to-text, text-to-speech, speech translation, and speaker recognition. This
architecture uses the Azure AI Speech batch transcription functionality.
Alternatives
Depending on your scenario, you can add the following workflows.
Scenario details
This solution uses Azure AI Speech to convert audio into written text. Azure AI Language
redacts sensitive information in the conversation transcription. Azure OpenAI extracts
insights from customer conversation to improve call center efficiency and customer
satisfaction. Use this solution to process transcribed text, recognize and remove
sensitive information, and perform sentiment analysis. Scale the services and the
pipeline to accommodate any volume of recorded data.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that can be used to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
Find the availability service level agreement (SLA) for each component in SLAs for
online services .
To design high-availability applications with Storage accounts, see the
configuration options.
To ensure resiliency of the compute services and datastores in this scenario, use
failure mode for services like Azure Functions and Storage. For more information,
see the resiliency checklist for Azure services.
Back up and recover your Form Recognizer models.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Implement data protection, identity and access management, and network security
recommendations for Blob Storage, AI services, and Azure OpenAI.
Configure AI services virtual networks.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and
improve operational efficiencies. For more information, see Overview of the cost
optimization pillar.
The total cost of this solution depends on the pricing tier of your services. Factors that
can affect the price of each component are:
Performance efficiency
Performance efficiency is the ability of your workload to meet the demands placed on it
by users in an efficient manner. For more information, see Overview of the performance
efficiency pillar.
When high volumes of data are processed, it can expose performance bottlenecks. To
ensure proper performance efficiency, understand and plan for the scaling options to
use with the AI services autoscale feature.
The batch speech API is designed for high volumes, but other AI services APIs might
have request limits, depending on the subscription tier. Consider containerizing AI
services APIs to avoid slowing down large-volume processing. Containers provide
deployment flexibility in the cloud and on-premises. Mitigate side effects of new version
rollouts by using containers. For more information, see Container support in AI services.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Next steps
What is Azure AI Speech?
What is Azure OpenAI?
What is Azure Machine Learning?
Introduction to Blob Storage
What is Azure AI Language?
Introduction to Azure Data Lake Storage Gen2
What is Power BI?
Ingestion Client with AI services
Post-call transcription and analytics
Related resources
Use a speech-to-text transcription pipeline to analyze recorded conversations
Deploy a custom speech-to-text solution
Create custom language and acoustic models
Deploy a custom speech-to-text solution
Implement logging and
monitoring for Azure OpenAI
models
Azure AI services Azure API Management Azure Monitor Azure Active Directory
This solution provides comprehensive logging and monitoring and enhanced security
for enterprise deployments of the Azure OpenAI Service API. The solution enables
advanced logging capabilities for tracking API usage and performance and robust
security measures to help protect sensitive data and help prevent malicious activity.
Architecture
Workflow
1. Client applications access Azure OpenAI endpoints to perform text generation
(completions) and model training (fine-tuning).
7 Note
Load balancing of stateful operations like model fine-tuning, deployments,
and inference of fine-tuned models isn't supported.
3. Azure API Management enables security controls and auditing and monitoring of
the Azure OpenAI models.
a. In API Management, enhanced-security access is granted via Microsoft Entra
groups with subscription-based access permissions.
b. Auditing is enabled for all interactions with the models via Azure Monitor
request logging.
c. Monitoring provides detailed Azure OpenAI model usage KPIs and metrics,
including prompt information and token statistics for usage traceability.
4. API Management connects to all Azure resources via Azure Private Link. This
configuration provides enhanced security for all traffic via private endpoints and
contains traffic in the private network.
5. Multiple Azure OpenAI instances enable scale-out of API usage to ensure high
availability and disaster recovery for the service.
Components
Application Gateway . Application load balancer to help ensure that all users of
the Azure OpenAI APIs get the fastest response and highest throughput for model
completions.
API Management . API management platform for accessing back-end Azure
OpenAI endpoints. Provides monitoring and logging that's not available natively in
Azure OpenAI.
Azure Virtual Network . Private network infrastructure in the cloud. Provides
network isolation so that all network traffic for models is routed privately to Azure
OpenAI.
Azure OpenAI . Service that hosts models and provides generative model
completion outputs.
Monitor . End-to-end observability for applications. Provides access to
application logs via Kusto Query Language. Also enables dashboard reports and
monitoring and alerting capabilities.
Azure Key Vault . Enhanced-security storage for keys and secrets that are used by
applications.
Azure Storage . Application storage in the cloud. Provides Azure OpenAI with
accessibility to model training artifacts.
Microsoft Entra ID . Enhanced-security identity manager. Enables user
authentication and authorization to the application and to platform services that
support the application. Also provides Group Policy to ensure that the principle of
least privilege is applied to all users.
Alternatives
Azure OpenAI provides native logging and monitoring. You can use this native
functionality to track telemetry of the service, but the default cognitive service logging
doesn't track or record inputs and outputs of the service, like prompts, tokens, and
models. These metrics are especially important for compliance and to ensure that the
service operates as expected. Also, by tracking interactions with the large language
models deployed to Azure OpenAI, you can analyze how your organization is using the
service to identify cost and usage patterns that can help inform decisions on scaling and
resource allocation.
The following table provides a comparison of the metrics provided by the default Azure
OpenAI logging and those provided by this solution.
Request count x x
Latency x x
Model utilization x
Token utilization x x
(input/output)
characters)
Deployment operations x x
Scenario details
Large enterprises that use generative AI models need to implement auditing and
logging of the use of these models to ensure responsible use and corporate compliance.
This solution provides enterprise-level logging and monitoring for all interactions with
AI models to mitigate harmful use of the models and help ensure that security and
compliance standards are met. The solution integrates with existing APIs for Azure
OpenAI with little modification to take advantage of existing code bases. Administrators
can also monitor service usage for reporting.
ApiManagementGatewayLogs
| where OperationId == 'completions_create'
| extend modelkey = substring(parse_json(BackendResponseBody)['model'], 0,
indexof(parse_json(BackendResponseBody)['model'], '-', 0, -1, 2))
| extend model = tostring(parse_json(BackendResponseBody)['model'])
| extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])
['prompt_tokens']
| extend completiontokens = parse_json(parse_json(BackendResponseBody)
['usage'])['completion_tokens']
| extend totaltokens = parse_json(parse_json(BackendResponseBody)['usage'])
['total_tokens']
| extend ip = CallerIpAddress
| summarize
sum(todecimal(prompttokens)),
sum(todecimal(completiontokens)),
sum(todecimal(totaltokens)),
avg(todecimal(totaltokens))
by ip, model
Output:
ApiManagementGatewayLogs
| where OperationId == 'completions_create'
| extend model = tostring(parse_json(BackendResponseBody)['model'])
| extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])
['prompt_tokens']
| extend prompttext = substring(parse_json(parse_json(BackendResponseBody)
['choices'])[0], 0, 100)
Output:
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework,
which is a set of guiding tenets that you can use to improve the quality of a workload.
For more information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures that your application can meet the commitments you make to your
customers. For more information, see Overview of the reliability pillar.
This scenario ensures high availability of the large language models for your enterprise
users. The Azure application gateway provides an effective layer-7 application delivery
mechanism to ensure fast and consistent access to applications. You can use API
Management to configure, manage, and monitor access to your models. The inherent
high availability of platform services like Storage, Key Vault, and Virtual Network ensure
high reliability for your application. Finally, multiple instances of Azure OpenAI ensure
service resilience in case of application-level failures. These architecture components can
help you ensure the reliability of your application at enterprise scale.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable
data and systems. For more information, see Overview of the security pillar.
Cost optimization
Cost optimization is about reducing unnecessary expenses and improving operational
efficiencies. For more information, see Overview of the cost optimization pillar.
To help you explore the cost of running this scenario, we've preconfigured all the
services in the Azure pricing calculator. To learn how the pricing would change for your
use case, change the appropriate variables to match your expected traffic.
The following three sample cost profiles provide estimates based on the amount of
traffic. (The estimates assume that a document contains approximately 1,000 tokens.)
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal authors:
Other contributors:
Next steps
Azure OpenAI request form
Best practices for prompt engineering with OpenAI API
Azure OpenAI: Documentation, quickstarts, API reference
Azure-Samples/openai-python-enterprise-logging (GitHub)
Configure Azure Cognitive Services virtual networks
Related resources
Protect APIs with Azure Application Gateway and Azure API Management
Query-based document summarization
AI architecture design
Build language model pipelines
with memory
Bing Web Search Azure Cache for Redis Azure Pipelines
Stay ahead of the competition by being informed and having a deep understanding of
your products and competitor products. An AI/machine learning pipeline helps you
quickly and efficiently gather, analyze, and summarize relevant information. This
architecture includes several powerful Azure OpenAI Service models. These models pair
with the popular open-source LangChain framework that's used to develop applications
that are powered by language models.
7 Note
Some parts in the introduction, components, and workflow of this article were
generated with the help of ChatGPT! Try it for yourself , or try it for your
enterprise.
Architecture
1. Internal company documents for products are imported and converted into
searchable vectors. Product-related documents are collected from departments,
such as sales, marketing, and product development. These documents are then
scanned and converted into text by using optical character recognition (OCR)
technology.
2. A LangChain chunking utility chunks the documents into smaller, more
manageable pieces. Chunking breaks down the text into meaningful phrases or
sentences that can be analyzed separately and improves the accuracy of the
pipeline's search capabilities.
3. The language model converts each chunk into a vectorized embedding.
Embeddings are a type of representation that capture the meaning and context of
the text. By converting each chunk into a vectorized embedding, you can store and
search for documents based on their meaning rather than their raw text. To
prevent loss of context within each document chunk, LangChain provides several
utilities for this text splitting step, like capabilities for sliding windows or specifying
text overlap. Some key features include utilities for tagging chunks with document
metadata, optimizing the document retrieval step, and downstream reference.
4. Create an index in a vector store database to store the raw document text,
embeddings vectors, and metadata. The resulting embeddings are stored in a
vector store database along with the raw text of the document and any relevant
metadata, such as the document's title and source.
After the batch pipeline is complete, the real-time, asynchronous pipeline searches for
relevant information. The following steps are taken:
5. Enter a query and relevant metadata, such as your role in the company or the
business unit that you work in. An embeddings model then converts your query
into a vectorized embedding.
6. The orchestrator language model decomposes your query, or main task, into the
set of subtasks that are required to answer your query. Converting the main task
into a series of simpler subtasks allows the language model to address each task
more accurately, which results in better answers with less tendency for inaccuracy.
7. The resulting embedding and decomposed subtasks are stored in the LangChain
model's memory.
a. Top internal document chunks that are relevant to your query are retrieved from
your internal database. A fast vector search is performed for the top n similar
documents that are stored as vectors in Azure Cache for Redis.
b. In parallel, a web search for similar external products is performed via the
LangChain Bing Search language model plugin with a generated search query
that the orchestrator language model composes. Results are stored in the
external model memory component.
8. The vector store database is queried and returns the top relevant product
information pages (chunks and references). The system queries the vector store
database by using your query embedding and returns the most relevant product
information pages, along with the relevant text chunks and references. The
relevant information is stored in LangChain's model memory.
9. The system uses the information that’s stored in LangChain's model memory to
create a new prompt, which is sent to the orchestrator language model to build a
summary report that’s based on your query, company internal knowledge base,
and external web results.
10. Optionally, the output from the previous step is passed to a moderation filter to
remove unwanted information. The final competitive product report is passed to
you.
Components
Azure OpenAI Service provides REST API access to OpenAI's powerful language
models, including the GPT-3, GPT-3.5, GPT-4, and embeddings model series. You
can easily adapt these models to your specific task, such as content generation,
summarization, semantic search, converting text to semantically powerful
embeddings vectors, and natural-language-to-code translation.
Scenario details
This architecture uses an AI/machine learning pipeline, LangChain, and language models
to create a comprehensive analysis of how your product compares to similar competitor
products. The pipeline consists of two main components: a batch pipeline and a real-
time, asynchronous pipeline. When you send a query to the real-time pipeline, the
orchestrator language model, often GPT-4 or the most powerful available language
model, derives a set of tasks to answer your question. These subtasks invoke other
language models and APIs to mine the internal company product database and the
public internet to build a report that shows the competitive position of your products
versus the competitor products.
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributor:
Related resources
AI architecture design
Batch processing
Types of language API services
Query-based document summarization
Article • 08/31/2023
This guide shows how to perform document summarization by using the Azure OpenAI
GPT-3 model. It describes concepts that are related to the document summarization
process, approaches to the process, and recommendations on which model to use for
specific use cases. Finally, it presents two use cases, together with sample code snippets,
to help you understand key concepts.
Architecture
The following diagram shows how a user query fetches relevant data. The summarizer
uses GPT-3 to generate a summary of the text of the most relevant document. In this
architecture, the GPT-3 endpoint is used to summarize the text.
Workflow
This workflow occurs in near-real time.
Scenario details
Enterprises frequently create and maintain a knowledge base about business processes,
customers, products, and information. However, returning relevant content based on a
user query of a large dataset is often challenging. The user can query the knowledge
base and find an applicable document by using methods like page rank, but delving
further into the document to search for relevant information typically becomes a manual
task that takes time. However, with recent advances in foundation transformer models
like the one developed by OpenAI, the query mechanism has been refined by semantic
search methods that use encoding information like embeddings to find relevant
information. These developments enable the ability to summarize content and present it
to the user in a concise and succinct way.
Some benefits of using a summarization service for any use case are:
In-context learning
Azure OpenAI Service uses a generative completion model. The model uses natural
language instructions to identify the requested task and the skill required, a process
known as prompt engineering. When you use this approach, the first part of the prompt
includes natural language instructions and/or examples of the desired task. The model
completes the task by predicting the most probable next text. This technique is known
as in-context learning.
With in-context learning, language models can learn tasks from just a few examples. The
language model is provided with a prompt that contains a list of input-output pairs that
demonstrate a task, and then with a test input. The model makes a prediction by
conditioning on the prompt and predicting the next tokens.
There are three main approaches to in-context learning: zero-shot learning, few-shot
learning, and fine-tuning methods that change and improve the output. These
approaches vary based on the amount of task-specific data that's provided to the
model.
Zero-shot: In this approach, no examples are provided to the model. Only the task
request is provided as input. In zero-shot learning, the model depends on previously
trained concepts. It responds based only on data that it's trained on. It doesn't
necessarily understand the semantic meaning, but it has a statistic understanding that's
based on everything that it's learned from the internet about what should be generated
next. The model attempts to relate the given task to existing categories that it has
already learned about and responds accordingly.
Few-shot: In this approach, several examples that demonstrate the expected answer
format and content are included in the call prompt. The model is provided with a very
small training dataset to guide its predictions. Training with a small set of examples
enables the model to generalize and understand unrelated but previously unseen tasks.
Creating few-shot examples can be challenging because you need to accurately
articulate the task that you want the model to perform. One commonly observed
problem is that models are sensitive to the writing style that's used in the training
examples, especially small models.
When you create a GPT-3 solution, the main effort is in the design and content of the
training prompt.
Prompt engineering
Prompt engineering is a natural language processing discipline that involves discovering
inputs that yield desirable or useful outputs. When a user prompts the system, the way
the content is expressed can dramatically change the output. Prompt design is the most
significant process for ensuring that the GPT-3 model provides a desirable and
contextual response.
The architecture described in this article uses the completions endpoint for
summarization. The completions endpoint is an Azure Cognitive Services API that
accepts a partial prompt or context as input and returns one or more outputs that
continue or complete the input text. A user provides input text as a prompt, and the
model generates text that attempts to match the context or pattern that's provided.
Prompt design is highly dependent on the task and data. Incorporating prompt
engineering into a fine-tuning dataset and investigating what works best before using
the system in production requires significant time and effort.
Prompt design
GPT-3 models can perform multiple tasks, so you need to be explicit in the goals of the
design. The models estimate the desired output based on the provided prompt.
For example, if you input the words "Give me a list of cat breeds," the model doesn't
automatically assume that you're asking for a list of cat breeds. You could be asking the
model to continue a conversation in which the first words are "Give me a list of cat
breeds" and the next ones are "and I'll tell you which ones I like." If the model just
assumed that you wanted a list of cats, it wouldn't be as good at content creation,
classification, or other tasks.
As described in Learn how to generate or manipulate text, there are three basic
guidelines for creating prompts:
Show and tell. Improve the clarity about what you want by providing instructions,
examples, or a combination of the two. If you want the model to rank a list of
items in alphabetical order or to classify a paragraph by sentiment, show it that
that's what you want.
Provide quality data. If you're building a classifier or want a model to follow a
pattern, be sure to provide enough examples. You should also proofread your
examples. The model can usually recognize spelling mistakes and return a
response, but it might assume misspellings are intentional, which can affect the
response.
Check your settings. The temperature and top_p settings control how
deterministic the model is in generating a response. If you ask it for a response
that has only one right answer, configure these settings at a lower level. If you
want more diverse responses, you might want to configure the settings at a higher
level. A common error is to assume that these settings are "cleverness" or
"creativity" controls.
Alternatives
Azure conversational language understanding is an alternative to the summarizer used
here. The main purpose of conversational language understanding is to build models
that predict the overall intention of an incoming utterance, extract valuable information
from it, and produce a response that aligns with the topic. It's useful in chatbot
applications when it can refer to an existing knowledge base to find the suggestion that
best corresponds to the incoming utterance. It doesn't help much when the input text
doesn't require a response. The intent in this architecture is to generate a short
summary of long textual content. The essence of the content is described in a concise
manner and all important information is represented.
Example scenarios
Zero-shot prompt engineering is used to summarize the bills. The prompt and settings
are then modified to generate different summary outputs.
Dataset
The first dataset is the BillSum dataset for summarization of US Congressional and
California state bills. This example uses only the Congressional bills. The data is split into
18,949 bills to use for training and 3,269 bills to use for testing. BillSum focuses on mid-
length legislation that's between 5,000 and 20,000 characters long. It's cleaned and
preprocessed.
For more information about the dataset and instructions for download, see FiscalNote /
BillSum .
BillSum schema
In this use case, the text and summary elements are used.
Zero-shot
The goal here is to teach the GPT-3 model to learn conversation-style input. The
completions endpoint is used to create an Azure OpenAI API and a prompt that
generates the best summary of the bill. It's important to create the prompts carefully so
that they extract relevant information. To extract general summaries from a given bill,
the following format is used.
Python
openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01-preview"
prompt_i = 'Summarize the legislative bill given the title and the
text.\n\nTitle:\n'+" ".join([normalize_text(bill_title_1)])+ '\n\nText:\n'+
" ".join([normalize_text(bill_text_1)])+'\n\nSummary:\n'
response = openai.Completion.create(
engine=TEXT_DAVINCI_001
prompt=prompt_i,
temperature=0.4,
max_tokens=500,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
stop=['\n\n###\n\n'], # The ending token used during inference. Once it
reaches this token, GPT-3 knows the completion is over.
best_of=1
)
= 1
Ground truth: National Science Education Tax Incentive for Businesses Act of 2007 -
Amends the Internal Revenue Code to allow a general business tax credit for
contributions of property or services to elementary and secondary schools and for
teacher training to promote instruction in science, technology, engineering, or
mathematics.
Zero-shot model summary: The National Science Education Tax Incentive for Businesses
Act of 2007 would create a new tax credit for businesses that make contributions to
science, technology, engineering, and mathematics (STEM) education at the elementary
and secondary school level. The credit would be equal to 100 percent of the qualified
STEM contributions of the taxpayer for the taxable year. Qualified STEM contributions
would include STEM school contributions, STEM teacher externship expenses, and STEM
teacher training expenses.
Fine-tuning
Fine-tuning improves upon zero-shot learning by training on more examples than you
can include in the prompt, so you achieve better results on a wider number of tasks.
After a model is fine-tuned, you don't need to provide examples in the prompt. Fine-
tuning saves money by reducing the number of tokens required and enables lower-
latency requests.
For more information, see How to customize a model with Azure OpenAI Service.
This step enables you to improve upon the zero-shot model by incorporating prompt
engineering into the prompts that are used for fine-tuning. Doing so helps give
directions to the model on how to approach the prompt/completion pairs. In a fine-tune
model, prompts provide a starting point that the model can learn from and use to make
predictions. This process enables the model to start with a basic understanding of the
data, which can then be improved upon gradually as the model is exposed to more data.
Additionally, prompts can help the model to identify patterns in the data that it might
otherwise miss.
The same prompt engineering structure is also used during inference, after the model is
finished training, so that the model recognizes the behavior that it learned during
training and can generate completions as instructed.
Python
return proc_df
df_staged_full_train = stage_examples(df_prompt_completion_train)
df_staged_full_val = stage_examples(df_prompt_completion_val)
Now that the data is staged for fine-tuning in the proper format, you can start running
the fine-tune commands.
Next, you can use the OpenAI CLI to help with some of the data preparation steps. The
OpenAI tool validates data, provides suggestions, and reformats data.
Python
Python
payload = {
"model": "curie",
"training_file": " -- INSERT TRAINING FILE ID -- ",
"validation_file": "-- INSERT VALIDATION FILE ID --",
"hyperparams": {
"n_epochs": 1,
"batch_size": 200,
"learning_rate_multiplier": 0.1,
"prompt_loss_weight": 0.0001
}
}
Python
data = r.json()
print('Endpoint Called: {endpoint}'.format(endpoint = url))
print('Status Code: {status}'.format(status= r.status_code))
print('Fine tuning ID: {id}'.format(id=fine_tune_id))
print('Status: {status}'.format(status = data['status']))
print('Response Information \n\n {text}'.format(text=r.text))
Ground truth: National Science Education Tax Incentive for Businesses Act of 2007 -
Amends the Internal Revenue Code to allow a general business tax credit for
contributions of property or services to elementary and secondary schools and for
teacher training to promote instruction in science, technology, engineering, or
mathematics.
Fine-tuned model summary: This bill provides a tax credit for contributions to
elementary and secondary schools that benefit science, technology, engineering, and
mathematics education. The credit is equal to 100% of qualified STEM contributions
made by taxpayers during the taxable year. Qualified STEM contributions include: (1)
STEM school contributions, (2) STEM teacher externship expenses, and (3) STEM teacher
training expenses. The bill also provides a tax credit for contributions to elementary and
secondary schools that benefit science, technology, engineering, or mathematics
education. The credit is equal to 100% of qualified STEM service contributions made by
taxpayers during the taxable year. Qualified STEM service contributions include: (1)
STEM service contributions paid or incurred during the taxable year for services
provided in the United States or on a military base outside the United States; and (2)
STEM inventory property contributed during the taxable year which is used by an
educational organization located in the United States or on a military base outside the
United States in providing education in grades K-12 in science, technology, engineering
or mathematics.
For the results of summarizing a few more bills by using the zero-shot and fine-tune
approaches, see Results for BillSum Dataset .
Observations: Overall, the fine-tuned model does an excellent job of summarizing the
bill. It captures domain-specific jargon and the key points that are represented but not
explained in the human-written ground truth. It differentiates itself from the zero-shot
model by providing a more detailed and comprehensive summary.
Fine-tuning is not applied in the financial use case because there's not enough data
available to complete that step.
Dataset
The dataset for this use case is technical and includes key quantitative metrics to assess
a company's performance.
indexed).
completion : The ground truth summary of the report.
In this use case, Rathbone's financial report , from the dataset, will be summarized.
Rathbone's is an individual investment and wealth management company for private
clients. The report highlights Rathbone's performance in 2020 and mentions
performance metrics like profit, FUMA, and income. The key information to summarize is
on page 1 of the PDF.
Python
openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01-preview"
name = os.path.abspath(os.path.join(os.getcwd(), '---INSERT PATH OF LOCALLY
DOWNLOADED RATHBONES_2020_PRELIM_RESULTS---')).replace('\\', '/')
pages_to_summarize = [0]
# Using pdfminer.six to extract the text
# !pip install pdfminer.six
from pdfminer.high_level import extract_text
t = extract_text(name
, page_numbers=pages_to_summarize
)
print("Text extracted from " + name)
t
Zero-shot approach
When you use the zero-shot approach, you don't provide solved examples. You provide
only the command and the unsolved input. In this example, the Instruct model is used.
This model is specifically intended to take in an instruction and record an answer for it
without extra context, which is ideal for the zero-shot approach.
After you extract the text, you can use various prompts to see how they influence the
quality of the summary:
Python
#Using the text from the Rathbone's report, you can try different prompts to
see how they affect the summary
response = openai.Completion.create(
engine="davinci-instruct",
prompt=prompt_i,
temperature=0,
max_tokens=2048-int(len(prompt_i.split())*1.5),
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(response.choices[0].text)
>>>
- Funds under management and administration (FUMA) reached £54.7 billion at
31 December 2020, up 8.5% from £50.4 billion at 31 December 2019
- Operating income totalled £366.1 million, 5.2% ahead of the prior year
(2019: £348.1 million)
- Underlying1 profit before tax totalled £92.5 million, an increase of 4.3%
(2019: £88.7 million); underlying operating margin of 25.3% (2019: 25.5%)
# Different prompt
response = openai.Completion.create(
engine="davinci-instruct",
prompt=prompt_i,
temperature=0,
max_tokens=2048-int(len(prompt_i.split())*1.5),
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(response.choices[0].text)
>>>
- Funds under management and administration (FUMA) grew by 8.5% to reach
£54.7 billion at 31 December 2020
- Underlying profit before tax increased by 4.3% to £92.5 million,
delivering an underlying operating margin of 25.3%
- The board is announcing a final 2020 dividend of 47 pence per share, which
brings the total dividend to 72 pence per share, an increase of 2.9% over
2019
Challenges
As you can see, the model might produce metrics that aren't mentioned in the
original text.
Proposed solution: You can resolve this problem by changing the prompt.
The summary might focus on one section of the article and neglect other
important information.
Proposed solution: You can try a summary of summaries approach. Divide the
report into sections and create smaller summaries that you can then summarize to
create the output summary.
Python
# Body of function
text = extract_text(name
, page_numbers=pages_to_summarize
)
r = splitter(200, text)
tok_l = int(2000/len(r))
tok_l_w = num2words(tok_l)
res_lis = []
# Stage 1: Summaries
for i in range(len(r)):
prompt_i = f'Extract and summarize the key financial numbers and
percentages mentioned in the Text in less than {tok_l_w}
words.\n\nText:\n'+normalize_text(r[i])+'\n\nSummary in one paragraph:'
response = openai.Completion.create(
engine=TEXT_DAVINCI_001,
prompt=prompt_i,
temperature=0,
max_tokens=tok_l,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
t = trim_incomplete(response.choices[0].text)
res_lis.append(t)
response = openai.Completion.create(
engine=TEXT_DAVINCI_001,
prompt=prompt_i,
temperature=0,
max_tokens=200,
top_p=1.0,
frequency_penalty=0.5,
presence_penalty=0.5,
best_of=1
)
print(trim_incomplete(response.choices[0].text))
The input prompt includes the original text from Rathbone's financial report for a
specific year.
Ground truth: Rathbones has reported revenue of £366.1m in 2020, up from £348.1m in
2019, and an increase in underlying profit before tax to £92.5m from £88.7m. Assets
under management rose 8.5% from £50.4bn to £54.7bn, with assets in wealth
management increasing 4.4% to £44.9bn. Net inflows were £2.1bn in 2020 compared
with £600m in the previous year, driven primarily by £1.5bn inflows into its funds
business and £400m due to the transfer of assets from Barclays Wealth.
Observations: The summary of summaries approach generates a great result set that
resolves the challenges encountered initially when a more detailed and comprehensive
summary was provided. It does a great job of capturing the domain-specific jargon and
the key points, which are represented in the ground truth but not explained well.
The zero-shot model works well for summarizing mainstream documents. If the data is
industry-specific or topic-specific, contains industry-specific jargon, or requires industry-
specific knowledge, fine-tuning performs best. For example, this approach works well for
medical journals, legal forms, and financial statements. You can use the few-shot
approach instead of zero-shot to provide the model with examples of how to formulate
a summary, so it can learn to mimic the summary provided. For the zero-shot approach,
this solution doesn't retrain the model. The model's knowledge is based on the GPT-3
training. GPT-3 is trained with almost all available data from the internet. It performs
well for tasks that don't require specific knowledge.
For the results of using the zero-shot summary of summaries approach on a few reports
in the financial dataset, see Results for Summary of Summaries .
Recommendations
There are many ways to approach summarization by using GPT-3, including zero-shot,
few-shot, and fine-tuning. The approaches produce summaries of varying quality. You
can explore which approach produces the best results for your intended use case.
Based on observations on the testing presented in this article, here are few
recommendations:
Zero-shot is best for mainstream documents that don't require specific domain
knowledge. This approach attempts to capture all high-level information in a
succinct, human-like manner and provides a high-quality baseline summary. Zero-
shot creates a high-quality summary for the legal dataset that's used in the tests in
this article.
Few-shot is difficult to use for summarizing long documents because the token
limitation is exceeded when an example text is provided. You can instead use a
zero-shot summary of summaries approach for long documents or increase the
dataset to enable successful fine-tuning. The summary of summaries approach
generates excellent results for the financial dataset that's used in these tests.
Fine-tuning is most useful for technical or domain-specific use cases when the
information isn't readily available. To achieve the best results with this approach,
you need a dataset that contains a couple thousand samples. Fine-tuning captures
the summary in a few templated ways, trying to conform to how the dataset
presents the summaries. For the legal dataset, this approach generates a higher
quality of summary than the one created by the zero-shot approach.
Evaluating summarization
There are multiple techniques for evaluating the performance of summarization models.
Here's an example:
Python
Here's an example:
Python
import torchmetrics
from torchmetrics.text.bert import BERTScore
preds = "You should have ice cream in the summer"
target = "Ice creams are great when the weather is hot"
bertscore = BERTScore()
score = bertscore(preds, target)
print(score)
Similarity matrix. A similarity matrix is a representation of the similarities between
different entities in a summarization evaluation. You can use it to compare different
summaries of the same text and measure their similarity. It's represented by a two-
dimensional grid, where each cell contains a measure of the similarity between two
summaries. You can measure the similarity by using various methods, like cosine
similarity, Jaccard similarity, and edit distance. You then use the matrix to compare the
summaries and determine which one is the most accurate representation of the original
text.
Here's a sample command that gets the similarity matrix of a BERTScore comparison of
two similar sentences:
Python
The first sentence, "The cat is on the porch by the tree", is referred to as the candidate.
The second sentence is referred to as the reference. The command uses BERTScore to
compare the sentences and generate a matrix.
This following matrix displays the output that's generated by the preceding command:
For more information, see SummEval: Reevaluating Summarization Evaluation . For a
PyPI toolkit for summarization, see summ-eval 0.892 .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributors:
Next steps
Azure OpenAI - Documentation, quickstarts, API reference
What are intents in LUIS?
Conversational language understanding
Jupyter Notebook with technical details and execution of this use case
Related resources
AI architecture design
Choose a Microsoft cognitive services technology
Natural language processing technology
Conversation summarization
Azure AI services
Most businesses provide customer service support to help customers with product
queries, troubleshooting, and maintaining or upgrading features or the product itself. To
provide a satisfactory resolution, customer support specialists need to respond quickly
with accurate information. OpenAI can help organizations with customer support in a
variety of ways.
Conversation scenarios
Self-service chatbots (fully automated). In this scenario, customers can interact
with a chatbot that's powered by GPT-3 and trained on industry-specific data. The
chatbot can understand customer questions and answer appropriately based on
responses learned from a knowledge base.
Chatbot with agent intervention (semi-automated). Questions posed by
customers are sometimes complex and necessitate human intervention. In such
cases, GPT-3 can provide a summary of the customer-chatbot conversation and
help the agent with quick searches for additional information from a large
knowledge base.
Summarizing transcripts (fully automated or semi-automated). In most customer
support centers, agents are required to summarize conversations for record
keeping, future follow-up, training, and other internal processes. GPT-3 can
provide automated or semi-automated summaries that capture salient details of
conversations for further use.
This guide focuses on the process for summarizing transcripts by using Azure OpenAI
GPT-3.
Architecture
A typical architecture for a conversation summarizer has three main stages: pre-
processing, summarization, and post-processing. If the input contains a verbal
conversation or any form of speech, the speech needs to be transcribed to text. For
more information, see Azure Speech-to-text service .
Workflow
1. Gather input data: Feed relevant input data into the pipeline. If the source is an
audio file, you need to convert it to text by using a TTS service like Azure text-to-
speech.
2. Pre-process the data: Remove confidential information and any unimportant
conversation from the data.
3. Feed the data into the summarizer: Pass the data in a prompt via Azure OpenAI
APIs. In-context learning models include zero-shot, few-shot, or a custom model.
4. Generate a summary: The model generates a summary of the conversation.
5. Post-process the data: Apply a profanity filter and various validation checks to the
summary. Add sensitive or confidential data that was removed during the pre-
process step back into the summary.
6. Evaluate the results: Review and evaluate the results. This step can help you
identify areas where the model needs to be improved and find errors.
The following sections provide more details about the three main stages.
Pre-process
The goal of pre-processing is to ensure that the data provided to the summarizer service
is relevant and doesn't include sensitive or confidential information.
Here are some pre-processing steps that can help condition your raw data. You might
need to apply one or many steps, depending on the use case.
Remove personally identifiable information (PII). You can use the Conversational
PII API (preview) to remove PII from transcribed or written text. This example shows
the output after the API has removed PII:
Summarizer
OpenAI's text-completion API endpoint is called the completions endpoint. To start the
text-completion process, it requires a prompt. Prompt engineering is a process used in
large language models. The first part of the prompt includes natural language
instructions and/or examples of the specific task requested (in this scenario,
summarization). Prompts allow developers to provide some context to the API, which
can help it generate more relevant and accurate text completions. The model then
completes the task by predicting the most probable next text. This technique is known
as in-context learning.
7 Note
There are three main approaches for training models for in-context learning: zero-shot,
few-shot and fine-tuning. These approaches vary based on the amount of task-specific
data that's provided to the model.
Zero-shot: In this approach, no examples are provided to the model. The task
request is the only input. In zero-shot learning, the model relies on data that GPT-3
is already trained on (almost all available data from the internet). It attempts to
relate the given task to existing categories that it has already learned about and
responds accordingly.
Few-shot: When you use this approach, you include a small number of examples in
the prompt that demonstrate the expected answer format and the context. The
model is provided with a very small amount of training data, typically just a few
examples, to guide its predictions. Training with a small set of examples enables
the model to generalize and understand related but previously unseen tasks.
Creating these few-shot examples can be challenging because they need to clarify
the task you want the model to perform. One commonly observed problem is that
models, especially small ones, are sensitive to the writing style that's used in the
training examples.
The main advantages of this approach are a significant reduction in the need for
task-specific data and reduced potential to learn an excessively narrow distribution
from a large but narrow fine-tuning dataset.
With this approach, you can't update the weights of the pretrained model.
You can use this customization step to improve your process by:
Including a larger set of example data.
Using traditional optimization techniques with backpropagation to readjust the
weights of the model. These techniques enable higher quality results than the
zero-shot or few-shot approaches provide by themselves.
Improving the few-shot learning approach by training the model weights with
specific prompts and a specific structure. This technique enables you to achieve
better results on a wider number of tasks without needing to provide examples
in the prompt. The result is less text sent and fewer tokens.
Disadvantages include the need for a large new dataset for every task, the
potential for poor generalization out of distribution, and the possibility to exploit
spurious features of the training data, resulting in high chances of unfair
comparison with human performance.
Creating a dataset for model customization is different from designing prompts for
use with the other models. Prompts for completion calls often use either detailed
instructions or few-shot learning techniques and consist of multiple examples. For
fine-tuning, we recommend that each training example consists of a single input
example and its desired output. You don't need to provide detailed instructions or
examples in the prompt.
Post-process
We recommend that you check the validity of the results that you get from GPT-3.
Implement validity checks by using a programmatic approach or classifiers, depending
on the use case. Here are some critical checks:
Finally, reintroduce any vital information that was previously removed from the
summary, like confidential information.
In some cases, a summary of the conversation is also sent to the customer, along with
the original transcript. In these cases, post-processing involves appending the transcript
to the summary. It can also include adding lead-in sentences like "Please see the
summary below."
Considerations
It's important to fine-tune your base models with an industry-specific training dataset
and change the size of available datasets. Fine-tuned models perform best when the
training data includes at least 1,000 data points and the ground truth (human-generated
summaries) used to train the models is of high quality.
The tradeoff is cost. The process of labeling and cleaning datasets can be expensive. To
ensure high-quality training data, you might need to manually inspect ground truth
summaries and rewrite low-quality summaries. Consider the following points about the
summarization stage:
Prompt engineering: When provided with little instruction, Davinci often performs
better than other models. To optimize results, experiment with different prompts
for different models.
Token size: A summarizer that's based on GPT-3 is limited to a total of 4,098
tokens, including the prompt and completion. To summarize larger passages,
separate the text into parts that conform to these constraints. Summarize each part
individually and then collect the results in a final summary.
Garbage in, garbage out: Trained models are only as good as the training data that
you provide. Be sure that the ground truth summaries in the training data are well
suited to the information that you eventually want to summarize in your dialogs.
Stopping point: The model stops summarizing when it reaches a natural stopping
point or a stop sequence that you provide. Test this parameter to choose among
multiple summaries and to check whether summaries look incomplete.
Example scenario: Summarizing transcripts in
call centers
This scenario demonstrates how the Azure OpenAI summarization feature can help
customer service agents with summarization tasks. It tests the zero-shot, few-shot, and
fine-tuning approaches and compares the results against human-generated summaries.
Prompt Completion
Customer: Sweet.
Ideal output. The goal is to create summaries that follow this format: "Customer said x.
Agent responded y." Another goal is to capture salient features of the dialog, like the
customer complaint, suggested resolution, and follow-up actions.
Here's an example of a customer support interaction, followed by a comprehensive
human-written summary of it:
Dialog
Agent. I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.
Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off (or the best available
discounted price) to continue playing a game once it leaves the catalog.
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
Zero-shot
The zero-shot approach is useful when you don't have ample labeled training data. In
this case, there aren't enough ground truth summaries. It's important to design prompts
carefully to extract relevant information. The following format is used to extract general
summaries from customer-agent chats:
Python
rouge = Rouge()
# Run zero-shot prediction for all engines of interest
deploymentNames = ["curie-instruct","davinci-instruct"] # also known as
text-davinci/text-instruct
for deployment in deploymentNames:
url = openai.api_base + "openai/deployments/" + deployment + "/completions?
api-version=2022-12-01-preivew"
response_list = []
rouge_list = []
print("calling…" + deployment)
for i in range(len(test)):
response_i = openai.Completion.create(
engine = deployment,
prompt = build_prompt(prefix, [test['prompt'][i]], suffix),
temperature = 0.0,
max_tokens = 400,
top_p = 1.0,
frequence_penalty = 0.5,
persence_penalty = 0.0,
stop=["end"] # We recommend that you adjust the stop sequence based on the
dataset
)
scores = rouge.get_scores(normalize_text(response_i[‘choices’][ 0]
['text']),test['completion'][i])
rouge_list += [scores[0]['rouge-1']['f']],
response_list += [response_i]
summary_list = [normalize_text(i[‘choices’][0][‘text’]) for i in
response_list]
test[deployment + "_zeroshotsummary"] = summary_list
test[deployment + "_zeroshotroguescore"] = rouge_list
Dialog
Agent: I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.
Agent: Once a game leaves the Xbox game pass catalog you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.
Ground truth
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
Davinci result
The customer asked how long they could access games after they leave the Xbox game
pass catalog. The agent told them that they would need to purchase a digital copy of
the game from the Xbox app for windows or the Microsoft store in order to continue
playing it. The agent also reminded the customer that they would be notified prior to a
game leaving the Xbox game pass catalog. as a member the customer could purchase
any game in the catalog for up to 20 off or the best available discounted price.
Curie result
The customer asks the agent about the Xbox game pass. the agent tells the customer
that once a game leaves the Xbox game pass catalog the customer will need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game. The agent
also reminds the customer that Xbox will notify members prior to a game leaving the
Xbox game pass catalog.
Few-shot
When you use the few-shot approach, the model is provided with a small number of
examples.
summaries:"
Python
train_small = train[]
train_small_json = train_small.to_dict(orient='records')
compiled_train_prompt = build_prompt_fewshot(prefix,context_primer,
train_small_json, suffix)
Dialog
Agent: I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.
Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.
Ground truth
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
Davinci result
customer wanted to know how long they could access games after they leave game
pass. Agent informs that once a game leaves the Xbox game pass catalog the customer
would need to purchase a digital copy or obtain another form of entitlement to
continue playing the game.
Curie result
customer has a question about the game pass. customer is good. agent needs help with
the Xbox game pass. customer asks how long they can access the games after they
leave the game pass catalog. Agent informs that once a game leaves the Xbox game
pass catalog the customer will need to purchase a digital copy from the Xbox app for
windows or the Microsoft store play from a disc or obtain another form of entitlement
to continue playing the game. customer is happy to hear this and thanks agent.
Fine-tuning
Fine-tuning is the process of tailoring models to get a specific desired outcome from
your own datasets.
Dialog
Agent: I see that you need help with the Xbox Game Pass.
Customer: Yes. I wanted to know how long can I access the games after they leave game
pass.
Agent: Once a game leaves the Xbox Game Pass catalog, you'll need to purchase a
digital copy from the Xbox app for Windows or the Microsoft Store, play from a disc, or
obtain another form of entitlement to continue playing the game. Remember, Xbox will
notify members prior to a game leaving the Xbox Game Pass catalog. And, as a member,
you can purchase any game in the catalog for up to 20% off or the best available
discounted price to continue playing a game once it leaves the catalog.
Ground truth
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
Curie result
customer wants to know how long they can access the games after they leave game
pass. agent explains that once a game leaves the Xbox game pass catalog they'll need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game.
Conclusions
Generally, the Davinci model requires fewer instructions to perform tasks than other
models, such as Curie. Davinci is better suited for summarizing text that requires an
understanding of context or specific language. Because Davinci is the most complex
model, its latency is higher than that of other models. Curie is faster than Davinci and is
capable of summarizing conversations.
These tests suggest that you can generate better summaries when you provide more
instruction to the model via few-shot or fine-tuning. Fine-tuned models are better at
conforming to the structure and context learned from a training dataset. This capability
is especially useful when summaries are domain specific (for example, generating
summaries from a doctor's notes or online-prescription customer support). If you use
fine-tuning, you have more control over the types of summaries that you see.
For the sake of easy comparison, here's a summary of the results that are presented
earlier:
Ground truth
Customer wants to know how long they can access games after they have left Game
Pass. Agent informs customer that they would need to purchase the game to continue
having access.
The customer asked how long they could access games after they leave the Xbox game
pass catalog. The agent told them that they would need to purchase a digital copy of
the game from the Xbox app for windows or the Microsoft store in order to continue
playing it. The agent also reminded the customer that they would be notified prior to a
game leaving the Xbox game pass catalog. As a member the customer could purchase
any game in the catalog for up to 20 off or the best available discounted price.
The customer asks the agent about the Xbox game pass. the agent tells the customer
that once a game leaves the Xbox game pass catalog the customer will need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game. The agent
also reminds the customer that Xbox will notify members prior to a game leaving the
Xbox game pass catalog.
customer has a question about the game pass. customer is good. agent needs help with
the Xbox game pass. customer asks how long they can access the games after they
leave the game pass catalog. Agent informs that once a game leaves the Xbox game
pass catalog the customer will need to purchase a digital copy from the Xbox app for
windows or the Microsoft store play from a disc or obtain another form of entitlement
to continue playing the game. customer is happy to hear this and thanks agent.
customer wants to know how long they can access the games after they leave game
pass. agent explains that once a game leaves the Xbox game pass catalog they'll need to
purchase a digital copy from the Xbox app for windows or the Microsoft store play from
a disc or obtain another form of entitlement to continue playing the game.
Evaluating summarization
There are multiple techniques for evaluating the performance of summarization models.
Here's an example:
Python
Here's an example:
Python
import torchmetrics
from torchmetrics.text.bert import BERTScore
preds = "You should have ice cream in the summer"
target = "Ice creams are great when the weather is hot"
bertscore = BERTScore()
score = bertscore(preds, target)
print(score)
Python
The first sentence, "The cat is on the porch by the tree," is referred to as the candidate.
The second sentence is referred as the reference. The command uses BERTScore to
compare the sentences and generate a matrix.
This following matrix displays the output that's generated by the preceding command:
Responsible use
GPT can produce excellent results, but you need to check the output for social, ethical,
and legal biases and harmful results. When you fine-tune models, you need to remove
any data points that might be harmful for the model to learn. You can use red teaming
to identify any harmful outputs from the model. You can implement this process
manually and support it by using semi-automated methods. You can generate test cases
by using language models and then use a classifier to detect harmful behavior in the
test cases. Finally, you should perform a manual check of generated summaries to
ensure that they're ready to be used.
For more information, see Red Teaming Language Models with Language Models .
Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.
Principal author:
Other contributor:
Next steps
More information about Azure OpenAI
ROUGE reference article
Training module: Introduction to Azure OpenAI Service
Learning path: Develop AI solutions with Azure OpenAI
Related resources
Query-based document summarization
Choose a Microsoft cognitive services technology
Natural language processing technology