NIH Cloud Lab for AWS

The sheer quantity of resources available to learn AWS can quickly become overwhelming. NIH Cloud Lab’s goal is to make cloud easy and accessible for you, so that you can spend less time on administrative tasks and focus more on your research.

Use this page to learn about how to use AWS for research by exploring the linked resources and walking through the tutorials. If you are a beginner, we suggest you begin with this jumpstart section. If you already have foundational knowledge of AWS and cloud, feel free to skip ahead to the tutorials repository for in-depth examples of how to run specific workflows such as Gen AI, genomic variant calling and medical image analysis.

Getting Started

You can learn a lot of what is possible on AWS in the AWS Getting Started Tutorials Page and we recommend you go there and explore some of the tutorials on offer. Nonetheless, it can be hard to know where to start if you are new to the cloud. To help you, we thought through some of the most common tasks you will encounter doing cloud-enabled research and gathered tutorials and guides specific to those topics. We hope the following materials are helpful as you explore cloud-enabled research.

Overview

There are three primary ways you can run analyses using AWS: using Virtual Machines, Jupyter Notebook instances, and Serverless managed services. We give a brief overview of each of these here and go into more detail in the sections below. Virtual machines are like desktop computers, but you access them through the cloud console and you get to pick the operating system and the specs such as CPU and memory. In AWS, these virtual machines are called Elastic Compute Cloud or EC2 for short. Jupyter Notebook instances are virtual machines with preconfigured with Jupyter Lab. On AWS these are run through SageMaker, which is also AWS’s ML/AI platform. You decide what kind of virtual machine you want to ‘spin up’ and then you can run Juptyer notebooks on that virtual machine. Finally, Serverless managed services are services that allow you to run things, an analysis, an app, a website, and not have to deal with your own servers (VMs). There are still servers running somewhere, you just don’t have to manage them. All you have to do is call a command that runs your analysis in the background, and then see the outputs usually in a storage bucket.

Identity and Access Management (IAM)

Identity and Access Management (IAM) is the service that controls your roles and access to all of AWS. Check out the AWS Getting Started Page for more details. In Cloud Lab you do not have full acess to IAM, but you can create Roles and you can attach Permissions to those Roles. For example, you may need to grant your SageMaker Role additional permissions to access some AWS Services. To do this, navigate to IAM, then Roles. Search for SageMaker and select AmazonSageMaker-ExecutionRoleXYZ, where XYZ is your Role’s unique identifier. Next, go to Add Permissions and there you can attach policies as needed. See an example here.

Command Line Tools

Most tasks in AWS can be done without the command line, but the command line tools will generally make your life easier in the long run. Command line interface (CLI) tools are those that you use directly in a terminal/shell as opposed to clicking within a graphical user interface (GUI). The primary tool you will need is the AWS CLI, which will allow you to interact with instances or S3 buckets (see below) from your local terminal. Instructions for the CLI can be found here. If you are unable to install locally, you can use all the CLI commands from within EC2 and SageMaker instances, or from the Cloud Shell.

To configure the CLI, you will need to authenticate using access keys, which are unique strings that tell AWS that you are allowed to interact with the account. Within Cloud Lab, you will need to use Short Term Access Keys. If you are an NIH user, follow these instructions. Short Term Access keys differ from Long Term Access keys in that they only work for a short period of time. Once your time limit expires, you have to request new keys and then authenticate again. If you do not work at the NIH, but have a Cloud Lab account, you will not have access to STAKS and will need to use the AWS CLI within AWS (such as within a Sagemaker Notebook, EC2 instance or Cloud Shell). If you have issues with a tutorial in this repository, just email us with your issue at CloudLab@nih.gov.

If you are running bioinformatic workflows, you can leverage the serverless functionality of AWS using Amazon HealthOmics which is a service for genome-aware storage, serverless workflow execution (using WDL, Nextflow or CWL), and variant and annotation queries using Amazon Athena. Learn more by completing this AWS tutorial. For those who want to use other workflow managers, you can instead try the AWS Genomics CLI, which is a wrapper for genomics workflow managers and AWS Batch (serverless computing cluster). See our docs on how to set up the Genomics CLI for Cloud Lab. Supported workflow engines include Toil, Cromwell, minwdl, Nextflow, and Snakemake.

Amazon Marketplace

The AWS Marketplace is a platform similar to Amazon.com where you can search for and launch pre-configured solutions such as Machine Images. Examples of images you may launch would be those with enhanced security (see EC2 section) or ones opimized for various tasks like machine learning, platform-specific genomics, or accelerated genomics.

Amazon CodeWhisperer

Amazon CodeWhisperer is an AI coding companion that helps accelerate development by providing code suggestions in real time, you can integrate with your integrated development environment(IDE). The tool is free for individual use click “Use CodeWhisperer for free” tab in the link provided for setup instructions. Code Whisperer can be used on Visual Studio Code (VScode), Amazon SageMaker Studio, JupyterLab, AWS Glue Studio, AWS lambda, and AWS Cloud9.

Ingest and Store Data using Amazon S3

Data can be stored in two main places on the cloud: either in a cloud storage bucket, like Amazon Simple Storage Service (S3) on AWS, or on an instance, typically with Elastic Block Storage. Block storage, such as on a virtual machine, has a finite size (e.g., 200 GB), while S3 offers object storage with scalability, meaning there’s no upper limit on storage size. However, there’s a 5 TB limit on individual items you upload to S3, necessitating the breaking down of larger files.

It’s generally recommended to segregate compute and storage, storing data in S3 for access and copying only necessary data to a specific instance for analysis, then returning the results to S3. Data on an instance is only accessible when the instance is running, unlike S3, which provides constant access, serving as a longer-term storage solution. Check out this tutorial for comprehensive guidance on using S3 effectively.

For frequently used files like reference genomes or protein databases, consider attaching a disk to your instance, allowing for smaller instance sizes and reduced EBS storage costs. For many use cases you will need to use more performant storage, the best of which is Elastic File Storage or FSx for Luster. You can also explore an open source solution called Juice FS or the Nextflow implementation from MMCloud called JuiceFlow.

Here are some handy tips for moving and storing data: Use the AWS CLI to upload files to an S3 bucket or move folders with the –recursive flag. Similarly, transfer data between S3 and your local machine or EC2 instance using the CLI. To move data to an instance, utilize scp, ensuring the instance is running, then consider using aws s3 cp to migrate data to S3 for safekeeping.

For transferring data from the Short Read Archive (SRA) to an instance or S3, leverage the SRA Toolkit with best practices outlined in our SRA Toolkit tutorial.

Spin up a Virtual Machine and run a workflow

Virtual machines (VMs) on AWS are called Amazon Elastic Compute Cloud (EC2) and are like virtual computers that you access via SSH and which start as (nearly) completely blank slates. You have complete control over the VM configuration beginning with the operating system. You can choose a variety of Linux flavors, as well as macOS and Windows. Virtual Machines are organized into machine families with different functions, such as General Purpose, Compute Optimized, Accelerated Computing etc. You can also select machines with graphics processing units (GPUs), which run very quickly for some use cases, but also can cost more than most of the CPU machines. Billing occurs on a per second basis, and larger and faster machine types cost more per second. This is why it is important to stop or delete machines when not in use to minimize costs, and consider always using an idle shutdown script.

Many great resources exist on how to spin up, connect to, and work on AWS VMs. Start with this Amazon documentation for different ways to connect to an EC2 instance. NIH staff will be able to connect from their local terminal via SSH or in the browser via Session Manager. If you are an NIH-affiliated researcher, you will only be able to use the Session Manager. We wrote a guide with screen shots that walks through SSH options, and for Windows VMs, look at this tutorial.

From a security perspective, we recommend that you use Center for Internet Security (CIS) Hardened VMs. These have security controls that meet the CIS benchmark for enhanced cloud security. To use these VMs, go to the AWS Marketplace > Discover Products. Then search for CIS Hardened and chose the OS that meets your needs. Click, Continue to Subscribe in the top right, and then Continue to Configuration and set your configuration parameters. Finally, click Continue to Launch. Here you decide how to launch the Marketplace solution; we recommend Launch from EC2, although you are welcome to experiment with the other options. Now click Launch and walk through the usual EC2 launch parameters. Click Launch and then you can view the status of your VM in the EC2 Instances page.

If you need to scale your VM up or down (see Cost Optimization below), you can always change the machine type by clicking on the instance ID, then go to Actions > Instance Settings > Change instance type. The VM must be stopped to change the instance type.

Disk Images and Elastic File Storage

Part of the power of virtual machines is that they offer a blank slate for you to configure as desired. However, sometimes you want to recycle data or installed programs for your next VM. One solution to this issue is using disk (or machine) images, where you copy your existing virtual disk to an Amazon Machine Image which can serve as a backup, or can be used to launch a new instance with the programs and data from a previous instance. AWS also takes snapshots of your instances, and you can convert these to machine images.

For some use cases, you will have some large files that you use over and over, such as reference genomes or protein databases (such as for AlphaFold or ESMFold). It doesn’t make sense to keep these stored on a VM or an AMI if that means paying for EBS storage. You will learn quickly that keeping EBS volumes around quickly adds up costs. A better solution is to use elastic files systems that you can attach to VMS (in EC2 or Batch) allowing you to maintain much smaller root EBS storage (and save costs). The two best services for this solution are Amazon Elastic File System and Amazon FSx.

Launch a SageMaker Notebook

SageMaker is AWS’s ML/AI platform, offering a hosted Jupyter notebook service. Notebooks are ideal for tutorials, combining code with instructions, and for step-by-step exploration of data or workflows. The ability to run code in chunks supports ML/AI problem-solving. Additionally, Jupyter Lab allows switching between terminal and notebook interfaces. Learn to spin up an instance and explore a genome-wide association studies using this notebook example.

Amazon recently launched a new IDE environment called Sagemaker Studio, which we recommend for Cloud Lab users. For a comprehensive workshop on Sagemaker Studio, go to this on-demand workshop. To launch Studio, first set up a Domain. Once launched, you can use the normal Sagemaker notebook features, except that you can resize your VM on the fly. You can also execute a whole ML/AI pipeline, including training, deploying, and monitoring, and you have ready access to AWS Jumpstart Models for easy to deploy large language models. If you hit a quota limit, follow these instructions. You can also launch Foundation Models directly from a notebook on the main Sagemaker menu on the left: Jumpstart > Foundation Models > View Model > Open Notebook in Studio.

Managing Containers with Elastic Container Registry and Code with CodeCommit

You can host containers within Amazon Elastic Container Registry. Learn how to build a container, push to Elastic Container Registry, and pull to a compute environment in our docs.

Further, you can manage your git repositories within your AWS account using AWS CodeCommit. Learn how to create a repository, authenticate to it, then push and pull files using standard git commands here

Clusters

One great thing about the cloud is its ability to scale with demand. When you submit a job to a traditional cluster, you specify up front how many CPUs and memory you want to give to your job, and you may over or under utilize these resources. With managed resources like serverless and clusters you can leverage a feature called autoscaling, where the compute resources will scale up or down with the demand. This is more efficient and keeps costs down when demand is low, but prevents latency when demand is high. For most users of Cloud Lab, the best way to leverage scaling is to use AWS Batch, but in some cases, maybe for a whole lab group or large project, it may make sense to spin up a Kubernetes cluster. Note that if you spin up resources in Batch, you will need to deactivate the compute environment (in Batch) and delete the autoscaling groups (in EC2) to avoid further charges. If using Batch, consider exploring MMCloud to use Spot instances and save up to 50% on your compute. You will need to request a trial license.

You can also spin up SLURM clusters using Parallel Cluster and automate SLURM environment provisioning using Cloud Formation. This recipe library contains a variety of recipes found on GitHub.

Billing and Benchmarking

Many Cloud Lab users are interested in understanding how to estimate the price of a large-scale project using a reduced sample size. Generally, you should be able to benchmark with a few representative samples to get an idea of time and cost required for a larger scale project. Follow our Cost Management Guide to see how to tag specific resources for workflow benchmarking. You should also review the AWS Documentation on Billing and Cost Management.

The best way to estimate costs is to use the AWS pricing calculator, which is a pricing tool that forecasts costs based on products and usage. Then, you can run some benchmarks and double check that everything is acting as you expect. For example, if you know that your analysis on your on-premeses cluster takes 4 hours to run for a single sample with 12 CPUs, and that each sample needs about 30 GB of storage to run a workflow, then you can extrapolate out how much everything may cost using the calculator (e.g. EC2 + S3). You can also watch this helpful video from the AnVIL project to learn more about Cloud Costs.

Cost Optimization

Follow our Cost Management Guide for details on how to monitor costs, set up budget alerts, and cost-benchmark specific analyses using resource tagging. In addition, here are a few tips to help you stay on budget.

  • Configure auto-shutdown on your EC2 instances. This will prevent you from accidentally leaving instances running.
  • Make sure you shut down other resources after you use them, and periodically ‘clean up’ your account. This can include S3 buckets, virtual machines/notebooks, Batch environments and Cloud Formation scripts. For Batch environments, you will also need to go to EC2 and delete the autoscaling groups (far bottom left option on the EC2 page).
  • Use elastic file systems instead of paying for unnecessary EBS storage. Take a look at Amazon Elastic File System and Amazon FSx.
  • Ensure that you are using all the compute resources you have provisioned. If you spin up a VM with 16 CPUs, you can see if they are all being utilized using CloudWatch. If you are only really using 8 CPUs for example, then just change your machine size to fit the analysis. You can also view our CPU optimization guide here.
  • Explore using Spot instances managed by MMCloud to ‘Spot surf’ and keep long running jobs going on Spot.

Getting Support

As part of your participation in Cloud Lab you will be added to the Cloud Lab Teams channel where you can chat with other Cloud Lab users, and gain support from the Cloud Lab team. For NIH Intramural users, you can submit a support ticket to Service Now. For issues related to the cloud environment, feel free to request AWS Enterprise Support. For all other questions, you can reach out to the Cloud Lab email at CloudLab@nih.gov with the subject line “Cloud lab Support Request”, or open a GitHub Issue.

If you have a question about Quota Limits, visit our documentation on how to request a limit increase.

Additional Training

This repo only scratches the surface of what can be done in the cloud. If you are interested in additional cloud training opportunities please visit the STRIDES Training page. For more information on the STRIDES Initiative at the NIH, visit our website or contact the NIH STRIDES team at STRIDES@nih.gov for more information.