The Benefits of Sharing Data – Why it’s worth the effort

HySpeed Computing explores the concepts and ideas behind community data sharing.

Sharing data with the greater community is not without effort. The data must be organized and thoroughly documented, a repository for hosting the data must be identified and maintained, and a chain of custody needs to be established to answer questions that may arise and to ensure data longevity. So you may ask why make the effort? What are the benefits? Here we list some of the top advantages of community data sharing.

Expanded impact. Sharing data will increase the number of citations for publications related to the data. In research, particularly academic research, publications are the primary currency for disseminating knowledge, establishing expertise in a given field of study, advancing career status, and obtaining grant funding. In some scientific disciplines it is common practice to publish the data in conjunction with the research methods and results; however, in most cases this is not the norm. Sharing data, either as an addendum to a publication or in a separate repository, provides more resources than the publication alone, leading to both a greater impact on the community and an improved return on citations.

Education. Sharing data provides a valuable resource for educating others. When learning something new there is nothing like a hands-on experience to help assimilate the knowledge; hence the reason most classes, seminars and training sessions involve a project or exercise. However, most of us can recall a situation where it seemed like the hardest aspect of the assignment was actually finding the data. Although this is a valuable lesson, and since data doesn’t always exist just because we think it should, this is also an indication that not enough quality data is readily available for educational purposes. Sharing data resources is thus an important component of improved opportunities for education.

Innovation. Sharing data is the foundation for continued innovation. Once collected or created for a given project, or series of projects, data has served an important role and fulfilled its initially conceived use. Beyond this there is almost certainly potential for new ideas and analysis methods to be developed based on this same data. These ideas can then spark new collaborations and projects, which can lead to yet more advances. Sharing data thereby represents a building block for enabling further innovation.

Archiving. Sharing data establishes a legacy for its continued utility. In some situations, such as many federally funded grants in the U.S., it is a requirement that researchers develop and execute a data management plan, which includes long-term plans for data storage and availability. Sharing data can thus be an important component of meeting grant requirements. Sharing also serves a role for establishing data longevity, providing a valuable resource for future research.


Community Data – Not all data is equal

HySpeed Computing explores the concepts and ideas behind community data sharing.

A common theme heard throughout the scientific community is the need for more open and more effective data sharing mechanisms. However, not all data is created equal nor should there be a single methodology or pathway for distributing data.

So what are the differences in data? Data can be categorized differently as a function of its origin and intended use. Accordingly, each data type has correspondingly different considerations associated with sharing. Below are examples of four main categories of data types.

Application Data. This category includes data that is routinely utilized for fulfilling the implementation needs of one or more applications. For example, satellite imagery from the suite of Landsat sensors provides a multidisciplinary resource for a broad array of earth observing applications, e.g., forestry, agriculture, coastal, urban monitoring projects, etc. Such data is typically housed in large repositories, offering users access to the data, but typically with little additional information beyond descriptors of the data characteristics and application domains.

Development Data. This is data utilized to develop new algorithms and analysis techniques. For example, data collected from a new instrument, such as the next generation AVIRIS sensor, is provided to the science community to test sensor performance and explore new analysis capabilities. These types of data are usually offered in smaller data repositories, sometimes with more restricted access, but typically include additional supporting documentation beyond just the data characteristics, such as sensor design information, science discussions and research results.

Validation Data. This category refers to data used for an existing research discovery and offered as a resource for others to validate the same findings. For example, satellite data documenting the declining ice coverage in the arctic regions is made available for multiple research groups to independently assess and validate conclusions on global change. As with the development category, such data is typically offered in smaller data repositories, which in addition to the data characteristics contain summaries of existing research methods and results that have already been obtained using the data.

Private Data. This type of data is that which contains personal or confidential information that if released could cause harm or damage. For example, imagery of military facilities can contain details that are inappropriate to be released publicly. In some cases such data can be openly distributed if deemed no longer sensitive, or safeguards are in place to restrict distribution or conceal particular elements, but in most cases such data is appropriately kept confidential.

Note that data categories are not exclusive of one another. A given dataset can easily fall into more than one category. The important point is to recognize the particular characteristics of the data and share it appropriately and as openly as possible.

HySpeed Computing will continue to explore different aspects of community data sharing in future posts, and will soon also be releasing its own data access portal.

Data for the Community – To Share or Not To Share

HySpeed Computing explores the concepts and ideas behind community data sharing.

Data is the foundation on which scientific research is built. Data is the basis for testing scientific hypotheses, developing new analysis techniques, deducing substantive correlations, identifying systematic trends, and generating research products. But what happens to this critical data once the report has been written and the paper has been published? What is the data legacy?

More often than not data has value beyond its initial use. For instance, data can be used by other researchers to corroborate results, integrate findings into larger more comprehensive data sets, investigate new hypothesis without replicating data collection efforts, form the basis for new research directions, and serve as example data for student research projects. While such data is sometimes made available to the community with these benefits in mind, it is all too often relegated to a dusty storage cabinet or forgotten computer hard drive.

To overcome this shortfall, there is an increasing movement towards data sharing. For example, as of January 2011, the U.S. National Science Foundation (NSF) requires research projects funded by the agency to include a Data Management Plan. The intent is that “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.” In another example, the Australian government is similarly encouraging data sharing through its AusGOAL program, which “provides support and guidance to government and related sectors to facilitate open access to publicly funded information.”

Open data sharing, however, is not a straightforward objective to achieve. There are many questions that need to be considered. For example: Who is responsible for maintaining the data archive – the researcher, the funding agency, the government? How long should data be retained and made available – five years, ten years, indefinitely? What are the costs associated with data storage – hardware, software, support personnel? Who should have access to the data – academics, government, industry? What are the licensing agreements associated with the data – public domain, research only, commercial? Additionally, some data can’t be made openly available due to concerns with privacy (e.g., medical research), confidentiality (e.g., intellectual property), or security (e.g., national defense).

Despite these challenges, many people are actively contributing data to the community, thereby extending the utility of their research and expanding their own influence. This trend is expected to continue. Data is an integral part of our intellectual growth and community knowledge. Are you sharing your data?