Community Data – Not all data is equal

HySpeed Computing explores the concepts and ideas behind community data sharing.

A common theme heard throughout the scientific community is the need for more open and more effective data sharing mechanisms. However, not all data is created equal nor should there be a single methodology or pathway for distributing data.

So what are the differences in data? Data can be categorized differently as a function of its origin and intended use. Accordingly, each data type has correspondingly different considerations associated with sharing. Below are examples of four main categories of data types.

Application Data. This category includes data that is routinely utilized for fulfilling the implementation needs of one or more applications. For example, satellite imagery from the suite of Landsat sensors provides a multidisciplinary resource for a broad array of earth observing applications, e.g., forestry, agriculture, coastal, urban monitoring projects, etc. Such data is typically housed in large repositories, offering users access to the data, but typically with little additional information beyond descriptors of the data characteristics and application domains.

Development Data. This is data utilized to develop new algorithms and analysis techniques. For example, data collected from a new instrument, such as the next generation AVIRIS sensor, is provided to the science community to test sensor performance and explore new analysis capabilities. These types of data are usually offered in smaller data repositories, sometimes with more restricted access, but typically include additional supporting documentation beyond just the data characteristics, such as sensor design information, science discussions and research results.

Validation Data. This category refers to data used for an existing research discovery and offered as a resource for others to validate the same findings. For example, satellite data documenting the declining ice coverage in the arctic regions is made available for multiple research groups to independently assess and validate conclusions on global change. As with the development category, such data is typically offered in smaller data repositories, which in addition to the data characteristics contain summaries of existing research methods and results that have already been obtained using the data.

Private Data. This type of data is that which contains personal or confidential information that if released could cause harm or damage. For example, imagery of military facilities can contain details that are inappropriate to be released publicly. In some cases such data can be openly distributed if deemed no longer sensitive, or safeguards are in place to restrict distribution or conceal particular elements, but in most cases such data is appropriately kept confidential.

Note that data categories are not exclusive of one another. A given dataset can easily fall into more than one category. The important point is to recognize the particular characteristics of the data and share it appropriately and as openly as possible.

HySpeed Computing will continue to explore different aspects of community data sharing in future posts, and will soon also be releasing its own data access portal.

Data for the Community – To Share or Not To Share

HySpeed Computing explores the concepts and ideas behind community data sharing.

Data is the foundation on which scientific research is built. Data is the basis for testing scientific hypotheses, developing new analysis techniques, deducing substantive correlations, identifying systematic trends, and generating research products. But what happens to this critical data once the report has been written and the paper has been published? What is the data legacy?

More often than not data has value beyond its initial use. For instance, data can be used by other researchers to corroborate results, integrate findings into larger more comprehensive data sets, investigate new hypothesis without replicating data collection efforts, form the basis for new research directions, and serve as example data for student research projects. While such data is sometimes made available to the community with these benefits in mind, it is all too often relegated to a dusty storage cabinet or forgotten computer hard drive.

To overcome this shortfall, there is an increasing movement towards data sharing. For example, as of January 2011, the U.S. National Science Foundation (NSF) requires research projects funded by the agency to include a Data Management Plan. The intent is that “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.” In another example, the Australian government is similarly encouraging data sharing through its AusGOAL program, which “provides support and guidance to government and related sectors to facilitate open access to publicly funded information.”

Open data sharing, however, is not a straightforward objective to achieve. There are many questions that need to be considered. For example: Who is responsible for maintaining the data archive – the researcher, the funding agency, the government? How long should data be retained and made available – five years, ten years, indefinitely? What are the costs associated with data storage – hardware, software, support personnel? Who should have access to the data – academics, government, industry? What are the licensing agreements associated with the data – public domain, research only, commercial? Additionally, some data can’t be made openly available due to concerns with privacy (e.g., medical research), confidentiality (e.g., intellectual property), or security (e.g., national defense).

Despite these challenges, many people are actively contributing data to the community, thereby extending the utility of their research and expanding their own influence. This trend is expected to continue. Data is an integral part of our intellectual growth and community knowledge. Are you sharing your data?