Data Management and You – A look at NSF requirements for data organization and sharing

This is Part 1 of a discussion series on data management requirements for government funded research.

NSF LogoData is powerful. From data comes information, and from information comes knowledge. Data is also a critical component in quantitative analysis and for proving or disproving scientific hypotheses. But what happens to data after it has served its initial purpose? And what are your obligations, and potential benefits, with respect to openly sharing data with other researchers?

Data management and data sharing is viewed with growing importance in today’s research environment, particularly in the eyes of government funding agencies. Not only is data management a requirement for most proposals using public funding, but effective data sharing can also work in your favor in the proposal review process. Consider the difference between two accomplished scientists, both conducting excellent research and publishing results in top journals, but only one of the scientists has made their data openly available, with 1000s of other researchers already accessing the data for further research. Clearly, the scientist who has shared data has created substantial additional impact on the community and facilitated a greater return on investment beyond the initially funded research. Such accomplishments can and should be included in your proposals.

As one example, let’s examine the data management requirements for proposals submitted to the U.S. National Science Foundation. What is immediately obvious when preparing a NSF proposal is the need to incorporate a two-page Data Management Plan as an addendum to your project description. Requirements for the Data Management Plan are outlined in the “Proposal and Award Policies and Procedures Guide” (2013) within both the “Grant Proposal Guide” and the “Award & Administration Guide.” Note that in some cases there are also specific data management requirements for particular NSF Directorates and Divisions, which need to be adhered to when submitting proposals for those programs.

To quote from the Data Management Plan: “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.” Accordingly, the proposal will need to describe the “types of data… to be produced in the course of the project”, “the standards to be used for data and metadata format”, “policies for access and sharing”, “policies and provisions for re-use, re-distribution, and the production of derivatives”, and “plans for archiving data… and for preservation of access.” Proposals can not be submitted without such a plan.

As another important consideration, if “any PI or co-PI identified on the project has received NSF funding (including any current funding) in the past five years”, the proposal must include a description of past awards, including a synopsis of data produced from these awards. Specifcally, in addition to a basic summary of past projects, this description should include “evidence of research products and their availability, including, but not limited to: data, publications, samples, physical collections, software, and models, as described in any Data Management Plan.”

Along these same lines, NSF also recently adjusted the requirements for the Biographical Sketch to specify “Products” rather than just “Publications.” Thus, in addition to previous items in this category, such as publications and patents, “Products” now also includes data.

The overall implication is that NSF is interesting in seeing both past success in impacting the community through data sharing and specific plans on how this will be accomplished in future research. Be sure to keep this this in mind when writing your next proposal. And remember… data is powerful.

For more information on NSF proposal guidelines: http://www.nsf.gov/bfa/dias/policy/

Advertisements

HySpeed Computing – Reviewing our progress and looking ahead

Join HySpeed Computing as we highlight our accomplishments from the past year and look ahead to what is sure to be a productive 2013.

The past year has been an eventful period in the life of HySpeed Computing. This was the year we introduced ourselves to the world, launching our website (www.hyspeedcomputing.com) and engaging the community through social media platforms (i.e., using the usual suspects – Facebook, LinkedIn and Google+). If you’re reading this, you’ve found our blog, and we thank you for your interest. We’ve covered a variety of topics to date, from community data sharing and building an innovation community to Earth remote sensing and high performance computing. As our journey continues we will keep sharing our insights and also welcome you to participate in the conversation.

August of 2012 marked the completion of work on our grant from the National Science Foundation (NSF). The project, funded through the NSF SBIR/STTR and ERC Collaboration Opportunity, was a partnership between HySpeed Computing and the Bernard M. Gordon Center for Subsurface Sensing and Imaging Systems at Northeastern University. Through this work we were able to successfully utilize GPU computing to accelerate a remote sensing tool for the analysis of submerged marine environments. Our accelerated version of the algorithm was 45x faster than the original, thus approaching the capacity for real-time processing of this complex algorithm.

HySpeed Computing president, Dr. James Goodman, also attended a number of professional conferences and meetings during 2012. This included showcasing our achievements in geospatial applications and community data sharing at the International Coral Reef Symposium in Cairns, Australia and the NASA HyspIRI Science Workshop in Washington, D.C. and presenting our accomplishments in remote sensing algorithm acceleration at the GPU Technology Conference in Pasadena, CA and the VISualize Conference in Washington, D.C. Along the way we met, and learned from, a wonderfully diverse group of other scientist and professionals. We are encouraged by the direction and dedication we see in the community and honored to be a contributor to this progress.

HyPhoonSo what are we looking forward to in 2013? You heard it here first – we are proud to soon be launching HyPhoon, a gateway for accessing and sharing both datasets and applications. The initial HyPhoon release will focus on providing the community with free and open access to remote sensing datasets. We already have data from the University of Queensland, Rochester Institute of Technology, University of Puerto Rico at Mayaguez, NASA, and the Galileo Group, with additional commitments from others. This data will be available for the community to use in research projects, class assignments, algorithm development, application testing and validation, and in some cases also commercial applications. In other words, in the spirit of encouraging innovation, these datasets are offered as a community resource and open to your creativity. We look forward to seeing what you accomplish.

Connect with us through our website or via social media to become pre-registered to be the first to access the data as soon as it becomes available!

Beyond datasets, HyPhoon will also soon include a marketplace for community members to access advanced algorithms, and sell user-created applications. Are you a scientist with an innovative new algorithm? Are you a developer who can help transform research code into user applications? Are you working in the application domain and have ideas for algorithms that would benefit your work? Are you looking to reach a larger audience and expand your impact on the community? If so, we encourage you to get involved in our community.

HySpeed Computing is all about accelerating innovation and technology transfer.

The NEON Science Mission – Open access ecological data

NEONInterested in assessing the ecological impacts of climate change? How about investigating the complex dynamics of ecological response to land use change and invasive species? What types of data would you need to perform such research at regional and continental scales? These are just some of the ambitious science questions being addressed by NEON – the National Ecological Observatory Network.

Sponsored by the U.S. National Science Foundation, NEON is an integrated network of 60 sites located throughout the U.S. where infrastructure is being put in place to collect a uniform array of scientific data. The hypothesis is that by providing consistent measurements and observations across the U.S., scientists will be better able to answer critical questions related to environmental change. Originally conceived in 1997, and followed by many years of planning, NEON entered its construction phase in 2012. Current plans are for the network to be fully operational in 2017, and for data from NEON to be collected for 30 years.

The 60 NEON sites encompass the continental U.S., Alaska, Hawaii and Puerto Rico. Sites were selected to represent a diverse range of vegetation communities, climate zones, land types, and land-use categories. The current list of NEON data products to be collected at each site include over 500 different entries, including both field and remote sensing observations. Items range from as detailed as genetic sequences and isotope analyses of field samples to as broad as temperature and wind speed measurements from meteorological instruments. Additionally, in what has become a welcome trend within the community, NEON data is being distributed using an open access policy.

Of particular interest to the remote sensing community is that NEON includes an Airborne Observation Platform (AOP) that will be used to collect digital photography, imaging spectroscopy data, and full-waveform LiDAR data. To accommodate the geographic distribution of NEON sites, this same suite of remote sensing instrumentation will be deployed on three different aircraft. Note that remote sensing data collection, as well as testing and validation of analysis protocols, has already begun and preliminary data is available upon request.

Given its scope, it is clear that the data and information derived from the NEON project will have a profound impact on our understanding of the natural environment and our ability to assess ecological change.

For more information on NEON: http://www.neoninc.org/

The German EnMAP Satellite – Open data access for the community

EnMAPCurrently orbiting the Earth is an international collection of satellite instruments, both government and commercial, designed for measuring and observing our planet. The applications are as varied as the number of satellites, and then some, with new capabilities being developed every day. Soon to be included in this impressive mix of technology is EnMAP (Environmental Mapping and Analysis Program) – a new hyperspectral satellite from the German Aerospace Center (DLR) scheduled for launch in 2015.

EnMAP builds on decades of successful research by remote sensing scientists around the world in the field of hyperspectral imaging, also known as imaging spectroscopy or imaging spectrometry. Unlike traditional multispectral sensors, which measure select subsets of the electromagnetic spectrum, hyperspectral sensors provide contiguous measurements of the entire spectrum. This typically equates to measuring more than 100-200 bands, rather than the usual 4-20 bands measured by multispectral systems. As a result, hyperspectral imaging allows scientists to analyze not just individual bands, or combinations of bands, but a multi-band profile of the full spectrum. This is equivalent to analyzing a curve rather than just points on a curve, thereby providing significantly more data with which to derive information about our planet.

The EnMAP sensor will measure a total of 242 bands from 420-2450nm, at a spatial resolution (ground sampling distance) of 30m. The instrument will have a pointing range of +/- 30°, allowing greater flexibility with respect to the location on the Earth’s surface that can be measured during any particular orbit. The sensor’s swath width on the ground will be 30km, and the system will be capable of measuring 1000km of data per orbit and 5000km of data per day.

Of great significance to the scientific community is that the EnMAP mission is embracing an open data policy, which means imagery will be made freely available for scientific use. The same policy is also being used for the currently available EnMAP simulated data, which allows researchers to develop and test EnMAP applications prior to instrument launch. Another notable aspect of the mission is the creation of EnMAP-Box, a platform-independent software tool being developed specifically for processing EnMAP imagery. The software will allow users to easily visualize and process EnMAP imagery using a suite of standard algorithms, as well as incorporate new user-contributed custom processing modules. It is admirable to see this level of community involvement embedded throughout the EnMAP mission.

The 2015 launch date might seem like it’s far away, but from the perspective of a satellite mission the date is just around the corner. Go EnMAP!

For more details on the EnMAP mission: http://www.enmap.org/

HyspIRI Science Workshop Day 3 – Community data and international collaboration

The final day of the HyspIRI Science Workshop saw emphasis on international collaborations and development of shared data resources for the remote sensing community. Vibrant conversations were heard around the meeting throughout the day, covering an array of topics, but mostly focusing on how remote sensing can be used to assist in addressing key societal questions, such as climate and environmental change.

In addition to ongoing presentations related to the NASA HyspIRI mission, colleagues from other countries described international efforts to develop satellite instruments using similar technologies. For example, DLR, the German Aerospace Center, reported great progress with EnMAP (Environmental Mapping and Analysis Programme). An exciting aspect of the EnMAP mission is that agreements have recently been established to make data from the mission freely available to interested researchers. Advances are also being made with HISUI (Hyperspectral Imager Suite), which is being developed by the Japanese Ministry of Economy, Trade and Industry, and with PRISMA (PRecursore IperSpettrale della Missione Applicativa), which is a combined imaging spectrometer and panchromatic camera system under development by the Italian Space Agency.

HyspIRI - Guild et al.

Liane Guild (NASA ARC) discusses NASA’s COAST project with Sergio Cerdeira Estrada (CONABIO), Frank Muller-Karger (USF) and Ira Leifer (UCSB)

But it wasn’t all about satellites. Significant attention was also placed on the various airborne missions being used to demonstrate technology readiness, as well as perform their own valuable scientific investigations. This includes instruments such as AVIRIS, AVIRIS-ng, HyTES, PHyTIR, PRISM and APEX. The research being conducted using these instruments, which include both imaging spectrometers and multispectral thermal systems, is vital for validating engineering design components, data delivery mechanisms, calibration procedures, and image analysis algorithms. As a result, these instruments represent important steps forward in the progress of the HyspIRI mission. However, they also independently have great value, providing numerous opportunities for remote sensing scientists to develop new methods and deliver innovative research results.

In addition to the instruments themselves, scientists are also working towards improving overall data availability, calibration techniques and field validation methods. For example, NASA JPL is enlisting the remote sensing community to build an open-access spectral library, with the impressive goal of cataloging the spectral characteristics of as many of the Earth’s natural and manmade materials as possible. Such spectra represent important components in a variety of image classification and analysis algorithms. Other programs, such as the NEON project in the U.S. and the TERN project in Australia, are focused on collecting field data from example study sites and providing the data for others to use in their own research projects. It’s encouraging to see this level of community and collaboration.

As evidenced by the presentations and posters at the workshop, imaging spectrometry is a mature science with a wealth of proven application areas. However, this won’t stop scientists from continuing to innovate and push the limits of what can be achieved using this technology. There’s always a new idea around the next corner, and it’s workshops like this that help promote information exchange, development of new collaborations, and the creation of new research directions.

Presentations from the HyspIRI Science Workshop and information on the HyspIRI mission can be found at http://hyspiri.jpl.nasa.gov/

Algorithm Validation – Using datasets and challenges to assess performance

Validation is an important component of algorithm development. Validation is the process by which developers confirm that a given algorithm meets acceptable levels of accuracy and performance. Achieving effective validation requires a dataset with known input and output parameters, whereby algorithm outputs can be directly compared against the already established output values.

What types of datasets are commonly used in the remote sensing validation process? There are a few different options that can be considered. One is to generate your own dataset, collecting relevant field data in conjunction with an image acquisition campaign. Another approach is to build a synthetic dataset with computer modeling techniques or carefully controlled laboratory methods. And yet a third option is to employ an independent dataset with its own well-defined data parameters. While each of these options are available for assessing algorithm performance, effective validation data is typically difficult to find or create.

With that in mind, a topic related to validation and an example of the independent dataset option, is the concept of algorithm challenges. The objective of an algorithm challenge is to utilize a common set of data and/or specifications as the basis for answering a particular problem. Participants in the challenge work independently or in teams to develop the best solution to that problem, where the top contributors usually receive at least recognition for their accomplishment and in some situations are also awarded a payment or prize. As an example, the Netflix Prize was a challenge focused on developing an improved algorithm for predicting user ratings on films. On a far grander scale, the X Prize Foundation is using the challenge format to address large societal issues, such as in healthcare, genomics and the environment.

DIRS RITIn remote sensing, an excellent example of an algorithm challenge is the Target Detection Blind Test run by the Digital Imaging and Remote Sensing (DIRS) laboratory at the Rochester Institute of Technology. In this challenge, participants are first provided with a dataset for algorithm development and testing, which includes high-resolution hyperspectral imagery, a spectral library of targets to be identified, and the exact locations of those targets in the imagery. A second dataset is then provided, which includes just the imagery and spectral library. The target locations are not specified. Instead, participants must employ their algorithm to generate estimates of the target locations. Results are uploaded to the DIRS website and the estimates are evaluated for accuracy in terms of both correctly identifying target locations and minimizing the number of false identifications. Submitted results are then ranked according to overall effectiveness.

Challenges also impart advantages such as encouraging innovation through competition and mobilizing the community to solve complex problems. Thus, the next time you develop a set of validation data, consider the benefits of transforming this data into a challenge. The resulting impact these datasets can have on the greater community and level of innovation can be immense.

The Benefits of Sharing Data – Why it’s worth the effort

HySpeed Computing explores the concepts and ideas behind community data sharing.

Sharing data with the greater community is not without effort. The data must be organized and thoroughly documented, a repository for hosting the data must be identified and maintained, and a chain of custody needs to be established to answer questions that may arise and to ensure data longevity. So you may ask why make the effort? What are the benefits? Here we list some of the top advantages of community data sharing.

Expanded impact. Sharing data will increase the number of citations for publications related to the data. In research, particularly academic research, publications are the primary currency for disseminating knowledge, establishing expertise in a given field of study, advancing career status, and obtaining grant funding. In some scientific disciplines it is common practice to publish the data in conjunction with the research methods and results; however, in most cases this is not the norm. Sharing data, either as an addendum to a publication or in a separate repository, provides more resources than the publication alone, leading to both a greater impact on the community and an improved return on citations.

Education. Sharing data provides a valuable resource for educating others. When learning something new there is nothing like a hands-on experience to help assimilate the knowledge; hence the reason most classes, seminars and training sessions involve a project or exercise. However, most of us can recall a situation where it seemed like the hardest aspect of the assignment was actually finding the data. Although this is a valuable lesson, and since data doesn’t always exist just because we think it should, this is also an indication that not enough quality data is readily available for educational purposes. Sharing data resources is thus an important component of improved opportunities for education.

Innovation. Sharing data is the foundation for continued innovation. Once collected or created for a given project, or series of projects, data has served an important role and fulfilled its initially conceived use. Beyond this there is almost certainly potential for new ideas and analysis methods to be developed based on this same data. These ideas can then spark new collaborations and projects, which can lead to yet more advances. Sharing data thereby represents a building block for enabling further innovation.

Archiving. Sharing data establishes a legacy for its continued utility. In some situations, such as many federally funded grants in the U.S., it is a requirement that researchers develop and execute a data management plan, which includes long-term plans for data storage and availability. Sharing data can thus be an important component of meeting grant requirements. Sharing also serves a role for establishing data longevity, providing a valuable resource for future research.

Technology in the Tropics – An overview of South Florida tech resources

There is a growing technology community emerging in South Florida. Developers and entrepreneurs in the area represent a diverse array of talent and exhibit an infectious innovative spirit. There’s also the added benefit of the tropical lifestyle and international culture that defines South Florida. Below are some resources to get you started and get you connected to the tech vibe and professional networks in the tropics.

Tech & Professional Groups/Meetings

Co-working Spaces

Incubators, Accelerators and Tech Research Parks

Investors

South Florida Resources

Florida Statewide Resources

Florida International University (FIU)

University of Miami (UM)

Florida Atlantic University (FAU)

Nova Southeastern University (NSU)

Chambers of Commerce

This list was last updated 12.21.2012. Suggestions are welcome for additional resources to add to the list.

Community Data – Not all data is equal

HySpeed Computing explores the concepts and ideas behind community data sharing.

A common theme heard throughout the scientific community is the need for more open and more effective data sharing mechanisms. However, not all data is created equal nor should there be a single methodology or pathway for distributing data.

So what are the differences in data? Data can be categorized differently as a function of its origin and intended use. Accordingly, each data type has correspondingly different considerations associated with sharing. Below are examples of four main categories of data types.

Application Data. This category includes data that is routinely utilized for fulfilling the implementation needs of one or more applications. For example, satellite imagery from the suite of Landsat sensors provides a multidisciplinary resource for a broad array of earth observing applications, e.g., forestry, agriculture, coastal, urban monitoring projects, etc. Such data is typically housed in large repositories, offering users access to the data, but typically with little additional information beyond descriptors of the data characteristics and application domains.

Development Data. This is data utilized to develop new algorithms and analysis techniques. For example, data collected from a new instrument, such as the next generation AVIRIS sensor, is provided to the science community to test sensor performance and explore new analysis capabilities. These types of data are usually offered in smaller data repositories, sometimes with more restricted access, but typically include additional supporting documentation beyond just the data characteristics, such as sensor design information, science discussions and research results.

Validation Data. This category refers to data used for an existing research discovery and offered as a resource for others to validate the same findings. For example, satellite data documenting the declining ice coverage in the arctic regions is made available for multiple research groups to independently assess and validate conclusions on global change. As with the development category, such data is typically offered in smaller data repositories, which in addition to the data characteristics contain summaries of existing research methods and results that have already been obtained using the data.

Private Data. This type of data is that which contains personal or confidential information that if released could cause harm or damage. For example, imagery of military facilities can contain details that are inappropriate to be released publicly. In some cases such data can be openly distributed if deemed no longer sensitive, or safeguards are in place to restrict distribution or conceal particular elements, but in most cases such data is appropriately kept confidential.

Note that data categories are not exclusive of one another. A given dataset can easily fall into more than one category. The important point is to recognize the particular characteristics of the data and share it appropriately and as openly as possible.

HySpeed Computing will continue to explore different aspects of community data sharing in future posts, and will soon also be releasing its own data access portal.

HySpeed Computing donates to Chesapeake Bay Foundation

On behalf of the speakers at VISualize 2012, a remote sensing conference hosted by Exelis Visual Information Solutions from June 18-19 at the World Wildlife Fund offices in Washington DC, HySpeed Computing is pleased to make a donation to the Chesapeake Bay Foundation.

CBFThe theme of VISualize 2012 was Climate Change and Environmental Monitoring. The event brought together thought leaders from non-profit, government, academia and industry organizations to share their vision on using remote sensing technology as an effective tool in environmental management applications. In the spirit of this theme, HySpeed Computing is proud to contribute to the efforts of the Chesapeake Bay Foundation and wishes them the best in their ongoing conservations activities.