Why the Cloud?
Advantages for NIH and Biological Data Sets
There are several key benefits that the Cloud provides in support of biomedical research:
- Enables collaboration (within labs, across labs, within consortia and even between scientific disciplines) because it enables equal access to the data and compute resources, regardless of the capabilities of the researcher’s institutional environment.
- Can be used to reduce wasteful (and expensive and error prone) duplication of data.
- Facilitates reproducible science by allowing the data and associated analysis pipelines to be shared with others.
- Can be configured as a secure environment where all data access is tracked, authenticated, and authorized.
The Cloud is also well-suited to address practical realities of modern biomedical informatics:
- Data sets are outgrowing local infrastructure, inhibiting researcher’s abilty to maintain data.
- Compute requirements to process these large datasets are exceeding local capacity, inhibiting researcher’s ability to analyze data.
- Downloading large scale data to a local computer for analysis may be more difficult than bringing computational capability to where the data is located.
- Data is currently scattered, hard to find, not ‘FAIR’ [Findable, Accessible, Interoperable, Reusable], thus collaborating on research projects across organizations can be challenging due to differences in local IT environments; The Cloud has the capacity to provide a single environment where these datasets can be stored for wider use.
Advantages of the Cloud
- You benefit from the economies of scale available to the CSPs and from the competition with other cloud vendors.
- Spot pricing — Reduced rates available to encourage the use of the CSPs idle compute resources.
- Reserved compute instance pricing — Compute costs can be less expensive if you reserve your compute resources for longer periods of time.
- Pay for what you use — Most CSP pricing is on a ‘per hour’ basis
- Constantly evolving toolset and features
- Cloud providers are constantly evolving their tools and features, particularly around compute capacity and storage, more recently in the areas of AI, Machine Learning, and networking.
- need to maintain local infrastructure
- Little or no capital expenditure to establish a compute infrastructure (though potentially a large operating expense instead).
- pandable capacity, no long-term commitments
- Easy to expand your resources on a temporary basis, e.g., to do a specific analysis with many more compute nodes, or increased data storage for a new dataset. The resources can be brought on line very quickly and then turned off once they are no longer needed.
- Amenable to DevOps approaches to allow ‘scriptable infrastructure’ or ‘Infrastructure as Code’
- tools like Ansible, Chef, Puppet, etc. can be used to describe the cloud infrastructure and bring it into being with little to no manual intervention.
- Allows hardware to quickly be created, torn down and recreated as needed, which is handy when you are paying by the hour.
- Also creates documentation for the system configuration as a convenient side-effect.
AWS has a page that describes a variety of existing uses of cloud technology for research and technical computing that outlines various AWS solutions in this space. AWS also has an eBook entitled Personalized Medicine and Genomic Research: Profiles in Cloud-Enabled Scientific Discovery covering four real-world genomic use cases (registration required to access ebook).
Azure has an Academic Research section that outlines the usage of cloud computing across multiple disciplines.
Potential Downsides for NIH and Biological Data sets
The Cloud has many benefits; however, there are also some caveats, given that the cloud can be perceived as a potential panacea for a wide variety of issues the biomedical community is now facing.
- Data uploaded into the cloud are not automatically FAIR, and therefore are simply not useful unless an effort goes into making them FAIR (with all that entails).
- Many scientific tools are currently not configured to work effectively in the cloud, and must be reconfigured.
- Resources (data, tools, etc) in the cloud are not guaranteed to persist, therefore this should not be a replacement for uploading data into a proper data repository.
- Data download costs (egress fees) from the cloud can get expensive for large data volumes. This is a vital consideration if you are putting data into the cloud to allow unfettered access to interested parties.
When the Cloud May Not be the Best Option
There are certain situations where the cloud may not be an appropriate solution to the computing problem at hand.
Very High-performance HPC
Due to the physical design and architecture of cloud hardware, it may not be appropriate for certain supercomputing applications such as computational chemistry (e.g., quantum mechanics/molecular mechanics simulations) or physics (e.g., computational fluid dynamics) where the node-to-node latency needs to be low. This is because the Cloud can sometimes have networks that are high latency or oversubscribed resulting in lower performance compared to a real “supercomputer.”
This situation may be improving with recent announcements from AWS; however, purists would still argue the Cloud is slow compared to the latest and greatest big iron.
When It’s Cost-effective to NOT Use the Cloud
There are situations where it’s more cost-effective not to use the Cloud in terms of total cost of ownership (TCO). For example, if you have a local system that does some regular analysis with local data and the resource is utilized 100% of the time, you basically have a steady-state workload and your TCO can often be lower than using the Cloud since you effectively amortize hardware cost over time.
Another example would be when the Cloud cost metering is not to your benefit or well understood. Imagine you set up a “data sharing” website that suddenly becomes very popular; you could end up paying significant data egress charges that you could have negotiated ahead of time or hosted in a different way.
These examples speak to the importance of understanding the IT challenges you are trying to solve and the asssociated infrastructure requirements needed for a system to address these challenges. It also highlights the need to understand how CSPs charge for their services and do some upfront planning and cost estimation, plus ongoing monitoring to ensure that there are no unpleasant surprises when the monthly bill appears.
Just as there are many good reasons for using a Cloud solution, there are also quite a few things to be aware of as you start down this road.
- There is often a need to reevaluate how solutions are architected for the Cloud. This is to allow the system to take advantage of the options available via the cloud — particularly its elastic capabilities — and to manage the different service cost model.
- Avoid over provisioning compute and storage (use what you need when you need, but then remove it when you are done).
- Using the elastic and dynamic capabilities of cloud services is highly recommended.
- Implementing system automation using DevOps tools supports a more dynamic system provisioning approach.
- Egress charges for data to the internet, or to different zones/regions of the CSP, may lead to very significant service bills (Users could potentially download large volumes of data unless you configure your cloud suitably).
- Plan on implementing a comprehensive monitoring solution for the utilized services and your application’s performance in order to periodically re-optimize the cloud solution. (Cloud providers typically monitor the services you consume but not necessarily the performance of your applications, which you will need to do).
- Sharing data between clouds and/or on-premise data centers can present challenges.
- Viable options: create an appropriate set of user roles, bucket policies, or signed URLS.
- Integrating your institution’s access control and authentication with those of the CSP can be a major effort. This will involve your IT Infosec and the CSP working together.
- Limited technical portability between CSPs (you may find yourself locked in to a particular CSP if you use a solution which relies on their proprietary technology).
- Increase in security vulnerabilities due to multi-tenancy (infrastructure such as network storage, hosting servers, networks, etc., is shared on public clouds).
Legal, Governance, and Compliance
- Reduced control on operational governance over CSP (you will have little control other than those spelled out in the terms and conditions you signed with the CSP).
- Multiregional compliance and legal issues (for example, cloud providers may not store your data in the Continental United States (CONUS) unless special agreements are in put in place).
- Legal and HIPAA requirements for handling certain types of PI or PHI data may make it necessary to have a Business Agreement (BA) with the CSP. Working through a CSP reseller may need additional BAs to be put in place. Delays getting these agreements in place can be longer than you may anticipate!
Staffing and Support
- Needs cloud technical support (you may not have personnel who are familiar with your CSP services. Support from CSP can be expensive!)