Preparing a Cloud Project

Understanding the big picture behind your data is essential because it creates context and enables you to make an informed decision about how to move your data to the cloud (e.g.,what are the data, who created it, what is it used for, etc.).

Additionally, preparing your data by removing any anomalies, de-identifying protected information, and profiling your data prior to the upload process are often cost-effective and time-saving practices.

Identifying Key Metrics and Preparing Your Data

Being prepared with some key metrics about your data will help ensure a streamlined process for uploading data to your chosen STRIDES partner’s cloud platform. The STRIDES team can help you answer these questions and determine the best partner to use by matching your needs with the available options.

Begin to gather the following information
Subject Question
Current size How large are the datasets currently (MB/GB/TB/PB)?
File count and size distribution How many individual files are there? What is the range of file sizes?
Data format What format is the data currently in?
Data upload Do you know about or want to use data prep tools that can help you prepare your data during the data upload process? Do you want to upload all the data at once? If not, how do you want to split the data into multiple uploads?
Data transfer Do you plan on moving the data off drives to a tape drive? Do you want to ship the data to the STRIDES Initiative partner?
State of data What state do you want the data to be in? How do you plan on cleaning the data (e.g., find anomalies, organize the data, etc.) before you transfer it to the cloud?</p>
Storage duration How long do these datasets need to live on the cloud (in months)?
Estimated growth What is the estimated rate of growth for new data (e.g., percentage increase per year)?
Backup and disaster recovery requirements What data needs to be backed up? How many versions or copies of the backup need to be kept? Where do they need to live?
Update frequency Will the data be updated or replaced frequently? Is the update a total rewrite of the entire dataset or just an incremental update to the existing dataset? What other tasks must happen following an update (e.g., re-indexing of the dataset)?
Access frequency How frequently is the data accessed (e.g., daily, a few times per week, once a month, once every six months, almost never)?
Desired access latency Does the data need to be available almost instantly on request, or is it okay if it appears in a few seconds, one to two minutes, or longer?
Data movement patterns Will you need to download data from the cloud to a local system, or will it remain in the cloud during any subsequent analysis steps?
Current data location Where does the data geographically live (country and/or general region within the country)? What devices do the data physically live on (e.g., disks, tape drives, etc.)? How many devices are there?
Predominant researcher location Where are the main researchers geographically located (country and/or general region within the country)?
Data movement Is there significant data movement to or from different sites that occurs during the regular course of business?
Data management How are you keeping your data secure and making it findable, accessible, interoperable, reusable (FAIR)?

  • Data Security
    • What measures do you have in place to protect your data including sensitive or protected data?

  • Findable
    • How is your data findable by your team and other collaborators?
      • How is the data indexed?

  • Accessible
    • How do users access your data?
      • Is the data open to the public?
      • What kind of sign-in process will you use if approval is required to access the data?

  • Interoperable
    • Will you connect different data types for your analysis?
      • If so, how?
      • What is the provenance of the data and how will it be curated for harmonization?

  • Reusable
    • What kinds of licenses are associated with the data?
      • Is ownership of the data clearly described?
      • If there is human data, are consent agreements propogated for other research uses?
      • Is this information associated with the metadata?

Configuring Data and Uploading Data to the Cloud

Before uploading your data into the cloud, configure your data for the needs of your program/project. For example, you might consider whether users are likely to need individual data files, sets of related files, a pre-computed analysis dataset, and/or the entire dataset in its original “raw” format. The process for uploading data can vary based on the STRIDES Initiative partner you choose and the sensitivity of your data, and many of the partners offer data migration tools. As a best practice, work with your chosen STRIDES Initiative partner as you begin the data migration process to ensure you are selecting the optimal approach, meeting any requirements of the cloud platform, and meeting any requirements of your research program/project and dataset(s).

For more information on how to configure and upload your data, visit your selected STRIDES Initiative partner’s website:

The STRIDES team can help facilitate the connection to your chosen STRIDES Initiative partner.