NIH Cloud Lab for Google Cloud

There are a lot of resources available to learn about GCP, which can be overwhelming. NIH Cloud Lab’s goal is to make cloud easy so that you can spend less time on administrative tasks and focus on your research.

Use this repository to learn about how to use GCP by exploring the linked resources and walking through the tutorials. If you are a beginner, we suggest you begin with this Jumpstart section. If you already have foundational knowledge of GCP and cloud, feel free to skip over to our Tutorials repository for in-depth examples of how to run specific workflows such as genomic variant calling and medical image analysis.

Getting Started

You can learn a lot of what is possible on GCP in the GCP Getting Started Page. There you can find links to documentation for common GCP tools and resources, and short videos on various subjects called cloud minute. You can also view the following Google Cloud Essentials Playlist or Cloud Bytes Playlist from Google to help you get started.

Even with a wealth of resources it can be difficult to know where to start on learning how to use the cloud. To help you, we thought through some of the most common tasks you will encounter doing cloud-enabled research and gathered tutorials and guides specific to those topics. We hope the following materials are helpful as you explore migrating your research to the cloud. Please feel free to submit issues or send us an email at CloudLab@nih.gov with questions.

Before going any further, make sure you can open your GCP project. For Intramural NIH staff, follow this guide. NIH-affiliated researchers can follow this guide.

Overview

There are three primary ways you can run analyses using GCP: using virtual machines, Jupyter notebook instances, and Managed services. We give a brief overview of each of these here and go into more detail in the sections below. Virtual machines are like your desktop computers, but you access them through the cloud console and you get to decide what resources are on that computer such as CPU and memory. In GCP, the platform that hosts these virtual machines is called Compute Engine. You access VMs via SSH (secure remote connections), either through the console or via the command line. Jupyter Notebook instances are virtual machines with Jupyter Lab preloaded onto them. On GCP these are run through Vertex AI or the new Colab Enterprise. You decide what kind of virtual machine you want to ‘spin up’ and then you can run Jupyter notebooks on that virtual machine. You access these notebooks through the console similar to the way you interact with Jupyter locally. Finally, Managed Services are services allow you to run things, an analysis, an app, a website, and not have to deal with your own servers (VMs). All you have to do is call a command that runs your analysis in the background and copies the output files to a storage bucket. The most common managed serverless feature you will work with here is Google Batch. Typically, these workflows are run from the command line, either from a VM, Cloud Shell, or your local terminal.

Command Line Tools

One other task that will enable all that comes below is installing and configuring the GCP SDK command line tools, which will allow you to interact with instances or Google Storage buckets from your local terminal. Command line interface (CLI) tools are those that you use directly in a terminal/shell as opposed to clicking within a graphical user interface (UI). Instructions for installing the CLI can be found here. Along the same lines, it is important to familiarize yourself with the two main CLI commands: gcloud and gsutil. There are also other commands you may come across in some circumstances like kubectl. If you have trouble installing the CLI on your local computer, you can still use the same commands from a virtual machine or from Cloud Shell, which is a terminal environment available to users on the GCP console.

Google Cloud Marketplace

The Google Cloud Marketplace is a platform where you can search for and launch pre-configured solutions such as Machine Images. Examples of images you may launch would be those with enhanced security or ones opimized for various tasks like machine learning, or accelerated genomics. One of our tutorials showcases using a Marketplace solution called miniKF, which you can test out here.

Ingest and Store Data using Google Cloud Storage

Data can be stored in a few places on the cloud: in a cloud storage bucket, like Google Cloud Storage (GCS), on an instance with several storage options. It’s recommended to keep compute and storage separate by storing data in GCS for access, copying only necessary data to an instance for analysis, then returning results to GCS. Unlike instance data, GCS data is always available. Check out this tutorial for guidance on using GCS effectively. You can also attach more performant file systems like Filestore or you can use JuiceFlow with Nextflow.

We give you a few tips here on moving and storing data: To upload data to a GCS bucket, use the UI or gsutil cp <gs://BUCKET>. Create a bucket with gsutil mb . Use gsutil cp -r <gs://BUCKET> to move a folder. Transfer data between GCS and local machine/VM with gsutil cp <gs://BUCKET/FILE> <DESTINATION/PATH>, and multithread actions with -m flag. For moving data from the Short Read Archive (SRA) to GCS or an instance, consider using fasterq_dump from the SRA toolkit, and refer to our [notebook](https://github.com/STRIDES/NIHCloudLabGCP/tree/main/notebooks/SRADownload/) for an example.

Spin up a Virtual Machine and run a workflow

Google and other sources have a lot of great resources on how to spin up and use a VM. The first place we will point you is to the NIH Common Data Fund resource, which lays out how to spin up a VM, SSH into it, make a bucket, and move data around similar to what we did in the example notebooks above. One thing worth noting is that the NIH tutorial has you SSH into your instance using a gcloud command in the shell. You can find the GCP specific documentation on how to spin up an instance here. If you want to start a Windows VM, read the documentation. We encourage you to follow our auto-shutdown instructions to prevent leaving machines running.

Disk Images

Part of the power of virtual machines is that they offer a blank slate for you to configure as desired. However, sometimes you want to recycle data or installed programs for your next VM instead of having to reinvent the wheel. One solution to this issue is using disk (or machine) images, where you copy your existing virtual disk to a Machine Image which can serve as a backup or can be used to launch a new instance with the programs and data from a previous instance.

Launch a Jupyter Notebook

Jupyter notebooks are web based interactive code platforms. On GCP, notebooks are launched through the Vertex AI platform. Vertex AI is Google’s current approach to Machine Learning and Artificial Intelligence workflows. You can read more about a Vertex AI Overview and technical documentation and tutorials. To spin up a Notebook instance and import an example training notebook, follow our guide here. If you want to practice using the terminal or review BASH commands in Jupyter, look at this module from Dartmouth developed for the NIGMS Sandbox.

You can also now use Colab Enterprise from within VertexAI, which allows you to run Colab notebooks within Google Cloud.

Managing Containers with Google Artifact Registry

You can host containers within either the older Google Container Registry, or else in the newer Google Artifact Registry, which can host containers as well as other artifacts. We outline how to build a container, push to an Artifact Registry, and pull to a compute environment in our docs.

Serverless Functionality

Serverless services are those that allow you to run things, an analysis, an app, a website etc., and not have to deal with servers (VMs). The most relevant serverless feature on GCP to Cloud Lab users (especially ‘omics’ analyses) is Google Batch. You can walk through a tutorial of this service using this notebook. Those doing health informatics should look into the Google Cloud Healthcare Data Engine. You can find a variety of other tutorials from the NIGMS Sandbox as well as this Google tutorial.

Clusters

One great thing about the cloud is its ability to scale with demand. When you submit a job to a traditional computing cluster (a set of computers that work together to execute a function), you have to specify up-front how many CPUs and how much memory you want to allocate to your job, and you may over- or under-utilize these resources. Alternatively, on the cloud you can leverage a feature called autoscaling, where the compute resources will scale up or down with the demand. This is more efficient and keeps costs down when demand is low, but prevents latency when demand is high (e.g., a whole hackathon submitting jobs to a cluster). For most users of Cloud Lab, the best way to benefit from autoscaling is to use an API like the Life Sciences API or Google Batch. If using Batch, consider MMCloud to allow you to ‘surf’ Spot instances and save up to 50% on costs.

In some cases, maybe for a whole lab group or a large project, it may make sense to spin up a Kubernetes cluster and submit jobs to the cluster using a workflow manager like Snakemake. One of our tutorials uses a marketplace solution to deploy a small Kubernetes cluster and then run AI models using Kubeflow. Finally, you can spin up SLURM clusters on GCP, which is easiest to do through the Cloud Shell or a VM, but can be accessed via SSH from your terminal. Instructions are here. The Terraform recipes for different SLURM configurations are hosted on this GitHub repo.

Billing and Benchmarking

Many Cloud Lab users are interested in understanding how to estimate the price of a large-scale project using a reduced sample size. Generally, you should be able to benchmark with a few representative samples to get an idea of the time and cost required for a larger scale project. In terms of cost, a great way to estimate costs is to use the GCP cost calculator, a tool that estimates costs based on location, VM type and size. Then, you can run some benchmarks and double check that everything is acting as you expect. For example, if you know that your analysis on your on-premises cluster takes 4 hours to run for a single sample with 12 CPUs, and that each sample needs about 30 GB of storage to run a workflow, then you can extrapolate to determine total costs using the calculator (e.g., compute engine + cloud storage).

To get a more precise estimate, you can use assign labels to your workflows, then generate a report for a specific label. You can learn how to do that in our docs. Note that it can take up to 24 hours to update the billing account, so you may need to wait a few hours after running an analysis before you will have an accurate report. You can also watch this helpful video from the AnVIL project to learn more about Cloud Costs.

Cost Optimization

As you go through all the tutorials, you can keep costs down by stopping and/or deleting resources (e.g., VMs or Buckets) you no longer need. Another strategy is to ensure that you are using all the compute resources you have provisioned. If you spin up a VM with 16 CPUs, you can see if they are all being utilized using Cloud Monitoring. If you are only really using 8 CPUs for example, then change your machine size to fit the analysis. You can also play with Spot instances to save money. Finally, you can create Budget Alerts to help you track your budget.

Managing Your Code with the Google Source Repository

You if want a Google-native Git solution, you can try the Google Cloud Source Repository. It uses all the normal git commands and will feel very familiar if you are used to using GitHub. If you want more background, or want to try it out, view our guide.

Getting Support

As part of the NIH Cloud Lab sign-up process, you will be added to the Cloud Lab Teams channel. Feel free to message others in the group for support and our team will also chime in and help. For other questions, you can reach out to the Cloud Lab email at CloudLab@nih.gov with the subject line “Cloud lab Support Request”, or open a GitHub Issue. For issues that the NIH Cloud Lab Support Team is unable to resolve, you can reach out to GCP enterprise support directly by clicking the question mark in the top right part of the console and opening a support case.

Additional Training

This repo only scratches the surface of what can be done in the cloud. If you are interested in additional cloud training opportunities. Please visit the STRIDES Training page. For more information on the STRIDES Initiative at the NIH, visit our website or contact the NIH STRIDES team at STRIDES@nih.gov for more information.