NIH Cloud Lab for Azure

NIH Cloud Lab’s goal is to make Cloud easy and accessible for you, so that you can spend less time on administrative tasks and focus more on research.

Use this repository to learn about how to use Azure by exploring the linked resources and walking through the tutorials. If you are a beginner, we suggest you begin with this jumpstart section. If you already have foundational knowledge of Azure and Cloud, feel free to skip ahead to the tutorials repository for in-depth examples of how to run specific workflows such as genomic variant calling and medical image analysis.

Getting Started

You can learn a lot of what is possible on Azure in the Azure Getting Started Tutorials Page and we recommend you go there and explore some of the tutorials on offer. Nonetheless, it can be hard to know where to start if you are new to the cloud. To help you, we thought through some of the most common tasks you will encounter doing cloud-enabled research, and gathered tutorials and guides specific to those topics. We hope the following materials are helpful as you explore using Azure!

Overview

There are three primary ways you can run analyses using Azure: using Virtual Machines, Jupyter Notebook instances, and Managed services. We give a brief overview of each of these here and go into more detail in the sections below. Virtual Machines, or VMs, are like desktop computers, but you access them through the cloud console and you get to pick the operating system and the specifications such as CPU and memory. Jupyter Notebook instances are virtual machines with a preconfigured Jupyter Lab. On Azure these are run through Azure Machine Learning Studio, which is also Azure’s AI/ML platform. You decide what kind of virtual machine you want to ‘spin up’ and then you can run Juptyer notebooks on those virtual machines using Jupyter Lab or VS Code. Finally, Serverless services are services that allow you to run things, an analysis, an app, a website, and not have to deal with your own servers (VMs). Azure Batch is a common example.

Resource Groups

A resource group is a container that holds related resources for an Azure solution. The resource group can include all the resources for the solution, or only those resources that you want to manage as a group. You decide how you want to allocate resources to resource groups based on what makes the most sense for your use case. Generally, add resources that share the same lifecycle to the same resource group so you can easily deploy, update, and delete them as a group. Each resource group stores metadata about the underlying resources. Therefore, when you specify a location for the resource group, you are specifying where that metadata is stored. To learn more, view our docs on Managing Resource Groups.

Command Line Tools

Most tasks in Azure can be done without the command line, but the command line tools will generally make your life easier in the long run. Command line interface (CLI) tools are those that you use directly in a terminal/shell as opposed to clicking within the Azure portal’s graphical user interface (GUI). The primary tool you will need is the Azure CLI, which will allow you to interact with Virtual Machines (VMs) or Storage Accounts (see below) from your local terminal. Instructions for the CLI can be found here. If you are unable to install locally, you can use all the CLI commands from within VM and Machine Learning instances, or from the Cloud Shell.

To install and configure Azure CLI, redirect to Get started with Azure CLI, which provides detailed instructions on installation as well as documentation on common Azure CLI commands. Microsoft Azure also has a cloud native service called Microsoft Genomics which offers cloud implementation of the Burrows-Wheeler Aligner (BWA) and the Genome Analysis Toolkit (GATK) for secondary analysis. Find documentation on how to use Microsoft Genomics here.

Azure Marketplace

The Microsoft Azure Marketplace is an online store in Azure that contains thousands of software applications and services to fit your needs. For example, you can find VMs configured for Microsoft Genomics or NVIDIA machine learning. Within Cloud Lab, the most common use case for the Marketplace will likely be CycleCloud, which is Azure’s High Performance Computing solution. If interested in CycleCloud, please contact us at CloudLab@nih.gov so we can help set this up in your Cloud Lab account.

Ingest and Store Data using Azure Storage Accounts

Microsoft’s object storage solution for the cloud is called Azure Blob. Blob is optimized for storing massive amounts of unstructured data. Azure also offers many other storage solutions listed here. To get started you must create a Storage Account. Users can grant limited access to Azure storage resources using Shared Access Signatures, or SAS. You can also read our guide to Storage Accounts and moving data in and out of Cloud Lab here. This Microsoft guide for moving genomic data is also very helpful.

Virtual Machines

Virtual machines (VMs) on Azure can be accessed via SSH or from the Azure portal. More information on VMs can be found here as well as this guide on how to use SSH keys with windows in Azure. To view the different types of VMs available in Azure check out the Virtual Machine Series.

You can also spin up preconfigured VMs, such as the Azure Data Science VM, which has many data science tools preinstalled and may save you time on environment set up. Read more in our docs. For more on VM best practices, review this guide

Azure Functions

Azure Functions is a serverless solution that allows you to write less code, maintain less infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud infrastructure provides all the up-to-date resources needed to keep your applications running. For more information click here. In general, you can consider functions for automating workflows.

Disk Images

Part of the power of virtual machines is that they offer a blank slate for you to configure as desired. Azure VM Image Builder simplifies the image building process allowing for custom built images to be saved. You can later redeploy these images to spin up a new machine with data or environments already installed.

Launch a Machine Learning Workspace

Azure Machine Learning studio is Azure’s ML/AI solution. ML studio allows for you to run your own code in managed Jupyter notebooks. Follow the Quickstart page to begin running Jupyter Notebooks in studio. Note that the compute instance is separate from your notebooks. Thus, you can select to run your compute in a Jupyter or VS Code environment.

The Azure file share account of your Azure Machine Learning workspace is mounted as a drive on the compute instance. This drive is the default working directory for Jupyter, Jupyter Labs, RStudio, and Posit Workbench. This means that the notebooks and other files you create in Jupyter, JupyterLab, RStudio, or Posit are automatically stored on the file share and available to use in other compute instances as well.

Clusters

One great thing about the cloud is its ability to scale with demand. When you submit a job to a traditional cluster, you specify up front how many CPUs and memory you want to give to your job, and you may over- or under-utilize these resources. With managed resources like serverless and clusters you can leverage a feature called autoscaling, where the compute resources will scale up or down with demand. This is more efficient and keeps costs down when demand is low, but prevents latency when demand is high (think about workshop participants all submitting jobs at the same time to a cluster). For most users of Cloud Lab, the best way to leverage scaling is to use Azure Batch, but in some cases it may make sense to spin up a Kubernetes cluster. If you are interested in using a scheduler like SLURM or Sun Grid Engine, you can use Azure CycleCloud, which has an easy to use GUI as well as CLI options. If interested in CycleCloud, please contact us at CloudLab@nih.gov and we will provision a CycleCloud instance for you.

Managing Containers with Azure Container Registry

You can host or pull containers with Azure Container Registry. See Microsoft’s documentation on how to use this service.

Billing and Benchmarking

Many Cloud Lab users are interested in understanding how to estimate the price of a large-scale project using a reduced sample size. Generally, you should be able to benchmark with a few representative samples to get an idea of time and cost required for a larger scale project. Follow our Cost Management Guide to see how to tag specific resources for workflow benchmarking.

The best way to estimate costs is to use the Azure pricing calculator here, which is a pricing tool that forecasts costs based on products and usage. Then, you can run some benchmarks and double check that everything is acting as you expect. See our docs for best practices when using this tool.

Cost Optimization

Follow our Cost Management Guide for details on how to monitor costs, set up budget alerts, and cost-benchmark specific analyses using resource tagging. In addition, here are a few tips to help you stay on budget. You can also configure auto-shutdown on your VM instances following this guide to prevent you from accidentally leaving instances running.

Getting Support

As part of your participation in Cloud Lab you will be added to the Cloud Lab Teams channel where you can chat with other Cloud Lab users, and gain support from the Cloud Lab team. For issues related to the cloud environment, feel free to request Azure Enterprise Support. For all other questions, you can reach out to the Cloud Lab email at CloudLab@nih.gov with the subject line “Cloud lab Support Request”, or open a GitHub Issue.

Additional Training

This repo only scratches the surface of what can be done in the cloud. If you are interested in additional cloud training opportunities, please visit the STRIDES Training page. For more information on the STRIDES Initiative at the NIH, visit our website or contact the NIH STRIDES team at STRIDES@nih.gov for more information.