Você está na página 1de 9

Environmental Modelling & Software 60 (2014) 241e249

Contents lists available at ScienceDirect

Environmental Modelling & Software


journal homepage: www.elsevier.com/locate/envsoft

A scientic data processing framework for time series NetCDF data


Krista Gaustad a, *, Tim Shippert a, Brian Ermold a, Sherman Beus a, Jeff Daily a,
Atle Borsholm b, Kevin Fox a
a
b

Pacic Northwest National Laboratory, 902 Battelle Boulevard, P.O. Box 999 MSIN K7-28, Richland, WA 99352, United States
Exelis Visual Information Solutions, Inc., 4990 Pearl East Circle, Boulder, CO 80301, United States

a r t i c l e i n f o

a b s t r a c t

Article history:
Received 24 August 2013
Received in revised form
2 June 2014
Accepted 7 June 2014
Available online 9 July 2014

The Atmospheric Radiation Measurement (ARM) Data Integrator (ADI) is a framework designed to
streamline the development of scientic algorithms that analyze, and models that use time-series
NetCDF data. ADI automates the process of retrieving and preparing data for analysis, provides a
modular, exible framework that simplies software development, and supports a data integration
workow. Algorithm and model input data, preprocessing, and output data specications are dened
through a graphical interface. ADI includes a library of software modules to support the workow, and a
source code generator that produces C, IDL, and Python templates to jump start development. While
developed for processing climate data, ADI can be applied to any time-series data. This paper discusses
the ADI framework, and how ADI's capabilities can decrease the time and cost of implementing scientic
algorithms allowing modelers and scientists to focus their efforts on their research rather than preparing
and packaging data.
2014 Elsevier Ltd. All rights reserved.

Keywords:
Atmospheric science
Time-series
NetCDF
Scientic data analysis
Observation data
Scientic workow
Data management

1. Introduction
Since 1992, the U. S. Department of Energy's Atmospheric Radiation Measurement (ARM) program (Stokes and Schwartz, 1994)
has been collecting, processing into Network Common Data Form
(NetCDF) format, and distributing data from highly instrumented
ground stations. The instrumentation is positioned across the globe
in both permanent and mobile facilities. The program maintains a
production data processing center that ingests the data collected
from its instruments, and creates higher quality data and data more
scientically relevant to achieving its goal of using the program's
data to evaluate and improve global climate models (GCMs). The
higher level data products, referred to as Value Added Products
(VAPs), are created by applying increasingly more advanced analysis techniques to existing data products. Examples include precipitable water vapor and liquid water path retrievals that improve
the modeling of the diabatic feedback from clouds in GCMs by
improving the understanding of the impact of clouds on the radiative ux (Turner et al., 2007), and a closure experiment designed
to analyze and improve ARM's Line-by-Line Radiative Transfer
Model (LBLRTM) and the spectral line parameters it uses (Turner

* Corresponding author. Tel.: 1 509 375 5950.


E-mail address: krista.gaustad@pnnl.gov (K. Gaustad).
http://dx.doi.org/10.1016/j.envsoft.2014.06.005
1364-8152/ 2014 Elsevier Ltd. All rights reserved.

et al., 2004). The latter has the potential to signicantly improve


GCM performance through contributing to small improvements in
the accuracy for radiative transfer models used by GCM's (Ellingson
and Wiscombe, 1996). ARM has also developed a Climate Modeling
Best Estimate (CMBE) data set for use by global climate modelers
(Xie et al., 2010). While these examples were implemented prior to
ADI's release, they exemplify the intent and nature of algorithms
that are currently, and will be developed within the ADI framework.
This paper describes how ADI simplies the access to, manipulation
of, and generation of time-series data products, and how ADI can be
used to expedite the development and analysis of robust, exible,
scientic algorithms and atmospheric process models.
As previously described, continued improvements to climate
and atmospheric process models' performance through the analysis
of ARM's cloud and radiation observations requires the implementation of increasingly more complex routines that examine
larger and more diverse datasets. VAPs recently released and those
currently under development typically access hundreds of variables
from many data sources, each with their own distinct coordinate
system grids. To work efciently with these heterogeneous input
data sets, and to reduce the time focused on managing complex
input data, ARM developed the ADI framework and associated
development environment. ADI automates, to the extent possible,
the integration of diverse input data into common formats,
streamlines the creation of standard, well-documented output data

242

K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249

products, and decreases the time needed to import a scientist's


prototype algorithm into ARM's production data processing environment. It supports automation through the use of a graphical
interface that stores retrieval process denitions allowing them to
be shared with others. Based on porting existing algorithms to ADI,
ARM has noted a decrease in the time needed to perform typical
pre- and post-data preparation tasks from about two days down to
a few hours for the simplest case of a single input and output data
product. For more complex algorithms, the time needed to implement the data retrieval, integration, and creation of output data
products decreased from several weeks to at most a few days.
Perhaps more importantly, the condence in the quality and consistency of the preprocessing has increased because of the use of a
standard set of functions and libraries.
While ADI was developed to support the incorporation of wellestablished algorithms into a production data processing center, its
preprocessing capabilities can also benet scientists and modelers
implementing models or testing algorithms for non-processing
system applications. It does this by facilitating the comparison,
parameterization, and evaluation of model data. Scientic research
frequently makes use of measurement data to develop and evaluate
theories and to illustrate key ndings. Before scientists can perform
analysis on data collected from instruments or from products
derived from instrument data, they must rst not only obtain input
data, but also frequently alter its format to meet their specic
needs. The modeling community also experience difculties in
working with non-standard programming interfaces, resource
consuming input data preparation and preprocessing, and integration of data in diverse le formats and coordinate grid shapes.
The use of multi-component, self-describing frameworks with
common interfaces has demonstrated advantages in terms of
metadata, provenance, error handling, and reproducibility that
improves the performance of, and user's interaction with a workow (Turuncoglu et al., 2013). ADI can be viewed as a type of integrated environmental modeling (IEM) framework in that it
supports the implementation of models as individual components,
moves the information between process and model components,
captures provenance, supports data transformations to prepare
data for the model(s), integrates standards and error handling,
supports multiple programming languages, and allows modelers to
retain ownership of their process allowing them to share and exchange models and processes (Whelan et al., 2014).
Recognizing that many scientists will not want to work within
the ADI development environment, but may still want to make
use of its automated data preparation and production capabilities,
a command line application of the ADI workow has been
implemented. This is referred to as the Data Consolidator application. Based on the information provided via the graphical
interface, the Data Consolidator application executes the data
retrieval, preprocessing, and data production workow providing
a dataset in the format and with the content needed for a user's
application or model. Thus, ADI can decrease the time and
expense associated with the non-scientic tasks necessary to
perform scientic analysis whether in a production data processing environment or to meet individual pre and post data
processing needs.
2. Comparison with existing tools
This section will describe already existing frameworks, data
transformation models, and supporting tools that perform similar
functions as the ADI framework and its libraries, summarize key
characteristics and capabilities that were considered in ADI's
design, and compare ADI with other platforms and tools in the
context of these qualities. It will also describe ADI's design decisions

in terms of alternative architectures, programming, and data


manipulation tools.
The need for a data integration platform to access, integrate,
analyze, and share data from diverse datasets is a problem not only
for the atmospheric science community (Woolf et al., 2005), but
also for many other scientic disciplines such as the biological
sciences (Nelson et al., 2011) and hydrology (Ames et al., 2012).
Across most disciplines, including the atmospheric sciences, the
typical solution is a framework through which tools specic to each
data processing step can be integrated and applied as needed. Many
framework solutions propose designs, but have not been implemented (Woolf et al., 2005) or have been prototyped, but have not
been widely used (Cheng et al., 2009). Frameworks designed specically for data transformations often focus on meeting the general transformation needs of data archives and warehouses, such as
providing transformations that support changes in data formats
(Abiteboul et al., 1999), the adoption of newer systems that use
different data models, or to clean and consolidate the data
(Claypool and Rundensteiner, 1999).
Fully operational atmospheric science frameworks have been
developed, with some very similar to ADI such as the High Spectral
Resolution Lidar (Eloranta, 2005), whose data download functionality also features a graphical interface (http://hsrl.ssec.wisc.edu/)
that allows users to select transformation parameters that will be
applied to the lidar data prior to delivery. However, this only provides transformation functionality to data from a single instrument.
Giovanni (Berrick et al., 2008) is a workow that provides a set of
data analysis recipes through which the user can manipulate data
and powerful tools for visualizing the data. It is also similar to ADI
in that it uses a GUI (Graphical User Interface) to simplify its use,
but it does not allow users to implement their own data recipes.
The Earth System Grid (ESG) (Bernholdt et al., 2005) is a project
intended to address distributed and heterogeneous climate dataset
management, discovery, access, and analysis, with an emphasis on
data delivery, but a current focus is on climate simulation datasets
(although it has expressed a long-term goal of supporting observation data). Like ARM, ESG expects data it distributes to adhere to
its own specied set of standards. While ESG intends to eventually
support data concatenation and sub-setting, its primary mechanism for providing data operations capability is by providing users
with tools developed by others. The Climate Data Analysis Tools
(CDAT) is the ESG's data analysis engine (Williams et al., 2009).
CDAT is described as a framework that consolidates access to a
disparate set of software tools for the discovery, analysis, and intercomparison of coupled multi-model climate data. As a framework
solution, CDAT includes low-level data toolkits such as the Climate
Model Output Rewrite, or CMOR (Doutriaux and Taylor, 2011).
CMOR is similar to ADI in that it is a software library used to produce NetCDF CF (Climate and Forecast)-compliant (Davis et al.,
2014) NetCDF les, has a built-in checker to conrm adherence to
CF standards, uses UDUNITS (Unidata, 2011), and allows users to
provide data of any type, unit, and dimension order. However, it
does not appear to have been developed to be an all-purpose data
writer as its design is geared toward preparing and managing
Model Intercomparison Project (MIP) output (i.e., the climate
community's standard model experiments), and only automatically
converts to the units and types expected by MIP models.
Scientic and gridded data transformation models are available
in numerical programming environments, and as libraries of
command-line, le-I/O-based operators. Numerical programming
environments that support data transformations include MATLAB1 (http://www.mathworks.com), its GNU version Octave

MATLAB is a registered trademark of The MathWorks, Inc.

K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249

(Eaton et al., 1997), S Plus, and its GNU version R (Team, 2005). Two
widely used le-based operator libraries include NetCDF Operators
(NCO) (Zender, 2008) and Climate Data Operators (CDO) (https://
code.zmaw.de/projects/cdo). The NetCDF Operators (NCO) is a
widely used set of low-level functions that supports the analysis of
self-describing gridded geoscience data. Many of the NCO and CDO
functions are similar to operations performed through the ADI GUI
(such as reading, writing, interpolating, and averaging). But these
operators, along with functionality not supported by ADI, could be
accessed from within the ADI framework as a useful alternative and
supplement to the methods already supported by ADI. Several
existing frameworks use NCO, such as the Script Workow Analysis
for Multi-Processing (SWAMP) system, which is a framework that
uses shellescript interfaces to optimize data analysis by running at
data centers, thus avoiding the overhead associated with moving
around big data sets (Wang et al., 2009).
An evaluation of existing tools and approaches was conducted in
the spring of 2009 to determine whether an existing system could
be used or leveraged to meet the architectural, standardized software development environment, data retrieval, transformation,
and creation requirements that ARM had determined were necessary to achieve the desired savings in algorithm development time
and cost. Not surprisingly, no single system was found that met the
program's needs. Existing solutions tended to focus either on
providing a exible architecture to integrate workow components
and data analysis tools, or on providing low-level tools that access,
manipulate, or create data. ADI falls in between these two paradigms, in that it didn't need most of the capabilities provided by
available architectures. The low-level data retrieval, manipulation,
and transformation libraries best suited for ARM's requirements
were designed to either work with NetCDF le inputs and outputs
(which makes them available to diverse users and systems, but
does not allow them to efciently be invoked from within an algorithm) or through environments well suited for design. These
libraries were not well suited for production data processing and
were generalized to all gridded systems (e.g. MATLAB and S Plus)
requiring some additional effort to apply them to time-series data.
Scientic workow applications, such as Eclipse and Kepler
(https://kepler-project.org), appeared particularly well suited for
the design and prototyping of algorithms ARM required. Recognizing this, the initial ADI prototype was implemented as a plug-in
for the Eclipse geographic software development environment and
extensible plug-in system (http://www.eclipse.org/) integrated
development environment. However, testing in the prototype stage
revealed that users were not making use of the platform. The
workow components needed to automate pre and post data
processing and jump start algorithm development are static and
follow a known order. This realization and the recognition of unneeded advanced features relating to visualizing data increased the
complexity and decreased efciency more than the benets gained.
As a result, the Eclipse environment was discontinued and scripts
are now used to execute the standard workow to minimize
complexity and improve processing efciencies. This does not
preclude the workow from being ported to a more exible
framework and the scripts can be replaced by calling the ADI
components from a system that supports user-dened workows.
ADI users can incorporate I/O-based operator libraries such as
CDO and NCO that provide mathematical, statistical, and data
transformation capabilities into their processing through the use of
functions provided to allow users to create intermediate output
les within the workow. These intermediate les could be used as
input to the operator library functions and the results pulled back
into the ADI process. Functionalities provided by supported languages are also readily available. Users who prefer working within
programming environments such as MATLAB can use the Data

243

Consolidator application as a pre-processing tool to create input


data products suitable for subsequent use within these
environments.
ADI stores data retrieved from input les in data structures
accessible to its modules and to users via data access functions. All
of the data retrieved is stored in memory. Large les currently can
be handled by simply reading in only the variables needed and by
limiting the number of records read in each pass through the ADI
workow by setting the appropriate processing parameters. A
process interval parameter controls the amount of data pushed
through the workow and a split interval sets the maximum le
size of the output data products. Data is processed through the
workow and incrementally written to create the output le until
the split interval size is reached. After that point, subsequent data is
written to a new le. This method of handling large les is only
helpful for processing that can be performed in sequential chunks.
ADI has been implemented on the Olympus computer supported by
Pacic Northwest National Laboratory (PNNL) Institutional
Computing (PIC) program (Pacic Northwest National Laboratory,
2014) for use with computationally intensive models and algorithms. In addition, the data assimilation and transformation
methods of ADI could t into a MapReduce framework to process
scalable datasets, and in the future ARM plans to implement
multiprocessor and distributed processing methods natively within
ADI.
ADI's framework and low-level data I/O, NetCDF access, and
manipulation functions were implemented in C because of its
processing efciencies, and because many other higher-level languages interact well with C. To support development in other languages, with the same look and feel developers expect in those
languages, bindings to the C-level functions implemented in each
supported language are needed. The script that denes the workow and the templates to jump start algorithm development must
also be implemented in each language. Python2 (Van Rossum and
Drake, 2003) and IDL3 (Interactive Data Language) (http://www.
exelisvis.com/idl) were selected as the initial programming languages that would be supported because of the wide use of IDL by
the atmospheric science community and the extensive and
continually growing scientic modules supported in Python. While
ARM does not currently have plans to support ADI in any other
development languages, support could be extended to other scientic languages such as MATLAB and R. MATLAB was evaluated as
a candidate development language, but the features that make it
valuable for scientists in designing their algorithms make it
signicantly slower than its alternatives. The GNU Data Language
was evaluated for use by the ARM infrastructure in 2011 (Coulais
et al., 2011) and was not considered stable and complete enough
to meet the infrastructure's needs at that time. The program will
revisit this decision in the future and could extend ADI support to
include GDL if such a review shows the language has evolved sufciently to meet the program's needs.
Table 1 notes the key characteristics and capabilities that were
considered in the design of ADI and other available scientic
workow architectures evaluated. Frameworks examined and the
references used to determine whether a capability was met
include: Framework for Meteosat data processing, FMet (Cermak
et al., 2007), High Spectral Resolution Lidar (Eloranta, 2005), Giovanni (Berrick et al., 2008), Earth System Grid, ESG (Bernholdt et al.,
2005), Climate Data Analysis Tools CDAT (Williams et al., 2009),
Script Workow Analysis for Multi-Processing, SWAMP (Wang

2
3

Python is a trademark of the Python Software Foundation.


IDL is a registered trademark of Exelis Inc.

244

K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249

Table 1
Comparison of ADI to alternative framework architectures.

software freely available


runs on non-proprietary operating system
user dened workows
user-dened transformations
can run user algorithms
data visualization tools
command line interface
congurable through GUI
languages supported
facilitates algorithm development
process large les

Fmet

HSRL

Gio-vanni

ESG

CDAT

SWAMP

REAP

ADI

Y
Y
Y
N
Y
Y
N
Y
Fortran
N
?

web app
N/A
N
Y(limited)
N
Y
N/A
Y
N/A
N/A
N/A

web app
N/A
N
N
N
Y
N
Y
N/A
N
N/A

web app
N/A
Y(via CDAT)
Y(via CDAT)
Y(via CDAT)
Y(via CDAT)
Y(via CDAT)
via CDAT
Y(via CDAT)
Y(via CDAT)
Y(via CDAT)

Y
Y
Y
Y
Y
Y
Y
Y
PythonJavaC/CFortran
Y
Y

Y
Y
Y
Y(via NCO)
Y
N
Y
N
NCO
no
Y(via NCO)

N
Y
Y
N
Y
N
N
N
N/A
N
?

Y
Y
Y
Y
N
Y
Y
CIDLPython
Y
Y(limited)

et al., 2009), and Real-time Environment for Analytical Processing,


REAP (Barseghian et al., 2010).
Table 2 compares ADI lower-level data manipulation capabilities
with those available in other tools. While the capabilities appear
very similar, analytical systems such as R and MATLAB and libraries
associated with programming languages provide limited integrated
support for accessing, analyzing, and creating scientic datasets.
NCO, CDO, CMOR, and ADI provide higher level functions that users
would have to write themselves using tools associated with
particular programming languages.
3. Materials and methods
The following sections discuss the composition of ARM data, the software libraries, software packages, and databases used to support ADI.

3.5. SASL
Simple Authentication and Security layer (SASL), version 2.1.23, is used for the
MD5 functions. ADI will be updated to use OpenSSL in the future.
3.6. Apache Flex
The browser-based Process Conguration Manager (PCM) graphical interface
was developed using the Apache Flex4 (Apache Software Foundation, 2014) software development kit but only the Adobe Flash Player is needed to run it.
3.7. Cheetah
To jump-start user algorithm development, the ADI template source code
generator uses the open source template engine and code generation tool, Cheetah
(Rudd et al., 2001) to generate the C, IDL, and Python data integration algorithms
with user hooks.
3.8. System architecture

3.1. ARM data


This initial implementation of ADI has been developed for use with NetCDF (Rew
and Davis, 1990) data produced by the ARM program. ARM data is publicly available
and can be accessed through the ARM Data Archive (http://www.archive.arm.gov),
requiring registration only to allow the program to keep metrics on the data it delivers. However, ADI can be used to process any time series NetCDF data that conforms to NetCDF standards. ARM instrument data is collected, converted from the
instrument's native or raw data format to NetCDF, and then delivered to the ARM
Data Archive for permanent storage. ARM creates higher-level data products by
applying scientic algorithms to the ingested instrument data.
3.2. Unidata NetCDF and UDunits software packages

The ADI libraries and applications have been tested with Red Hat Enterprise
Linux 2.5 and 2.6. The current ADI libraries have only been compiled and tested
for the Red Hat Enterprise Linux 2.5 and 2.6 operating systems, but they should
be able to run under any non-proprietary operating system (e.g. GNU/Linux,
BSD) that supports the previously discussed libraries and packages. The beta
version of the libraries was developed and tested for the Fedora5 13 operating
system.
3.9. IDL
To develop algorithms in IDL, version 8.2 is needed to support 64-bit.
3.10. Logs and metrics-related components

ADI's convenience functions use Unidata's NetCDF4 software libraries (Davis


et al., 2014) to access input les and create output les in the NetCDF format, and
the UDUNITS2 package (Unidata, 2011) to convert units.

ADI's data system database (DSDB) is implemented using the PostgreSQL 8.4
open-source Object-Relational database system (PostgreSQL, 2014). ADI can be
congured to build without it if only the Web Service backend is needed.

ADI has a built-in logging and messaging system. Most functions that ADI performs automatically (i.e. all the green boxes (in the web version) in Fig. 1.) will be
logged and unusual circumstances such as missing les or system issues will be
identied. ADI also provides standard tools for the developer to log and classify their
messages as regular logs, warnings, or errors. In this way every ADI process will have
a standardized log, which will allow for the development of automated methods for
tracking provenance, benchmarking, or data mining to track when and how often
specic events occur.

3.4. LIBCURL

4. Results

3.3. PostgreSQL database

Multiprotocol le transform library (http://curl.haxx.se/libcurl/), version 7.19.7,


is used to access the DSDB via a Web Service. ADI can be congured to build without
it if only the PostgreSQL backend is needed.

Table 2
Comparison of ADI to alternative data processing libraries and tools.
IDL Python MatLab, S Plus, CMOR NCO CDO ADI
Octave R
command line interface
scriptable
reads NetCDF
writes NetCDF
transformation functions
integrate data through GUI
provides mathematical,
statistical functions

Y
Y
Y
Y
N
N
Y

Y
Y
Y
Y
N
N
Y

Y
Y
Y
Y
N
N
Y

Y
Y
Y
Y
N
N
Y

N
Y
N
Y
N
N
N

Y
Y
Y
Y
Y
N
Y

Y
Y
Y
Y
Y
N
Y

N
Y
Y
Y
Y
Y
N

The Atmospheric Radiation Measurement (ARM) Data Integrator (ADI) is a suite of tools, libraries, data structures, and interfaces that simplify the development of scientic algorithms that
process time-series data. It minimizes the amount of source code
needed to implement an algorithm by automating data retrieval
and the creation of output data products, allowing development
efforts to be focused on the science. Built-in integration capabilities
include merging data, applying data type and unit conversions, and
coordinate dimension transformations. ADI's web-based graphical
interface, the Process Conguration Manager (PCM), is used to

4
Apache Flex and Apache are registered trademarks of The Apache Software
Foundation.
5
Fedora is a trademark of the Fedora Project.

K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249

245

Fig. 1. ARM ADI framework, Data System DataBase (DSDB) and related components.

dene the parameters that describe the data to be retrieved, how it


should be integrated, and the content of the output data products.
This information is stored in a central database and made available
through web services to ADI's processing framework and other
supporting applications. Fig. 1 shows the relationship of the
graphical interface to the database, the ADI framework, and supporting applications.
Once a process is dened in the PCM, users can develop their
own source code to run under the ADI framework. However, if a
user simply needs to consolidate data from existing NetCDF les or
transform data onto a new coordinate grid without any additional
analysis, the Data Consolidator application can be used without the
need to write any code.
The core modules executed by the workow are shown in the
green (in the web version) boxes in Fig. 2. A source code generation
tool is provided to jump-start user algorithm development. It uses
PCM retrieval process denitions to create an initial set of C, Python, or IDL software project les that execute the core ADI processing modules and to provide hooks into which users can insert

custom logic. The hooks, designed to allow users to access any point
in the framework, are represented in Fig. 2 as blue circles. The
remainder of this section describes the components of the ADI
framework in more detail.
4.1. Conguring a process
The Process Conguration Manager (PCM) is the graphical user
interface through which processes and datastreams are dened,
viewed, and edited. This interface simplies the development of
new ARM algorithms and data products by providing access to
existing ARM datastreams and variables for use in dening the
input and output of a new process.
Dening a process includes specifying the input data that needs
to be retrieved, the output datastream(s) that will be created
(Figure S1), the mapping of input data that is passed through to the
output, process specications such as the size of the data chunks
sent through the data integration workow, and the size of the data
products produced. ADI can also transform all input data onto a

Fig. 2. ARM ADI workow and framework hooks.

246

K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249

common grid, as well as automatically convert data types and units.


The process denition interfaces and their interrelationships to one
another are illustrated in online Supplementary materials
(Figure S2). Each box represents a window that can be accessed
through the PCM's graphical interface, and includes the name of the
form displayed in the window and the process attributes that are
collected from that form. The process attributes that are dened
based on an input or output datastream are noted with blue dashed
lines.
4.1.1. Specifying a retrieval
The PCM's Retrieval Editor form (Fig. 3) decreases the amount of
code VAP developers have to write by allowing users to set criteria
that the ADI workow can use to automate the opening of source
data les, retrieval of variables, and execution of coordinate system,
unit, and data type transformations.
Users can setup a hierarchical list of input data sources
(Figure S3), transform the retrieved data onto a new coordinate
grid, and can specify the variables to be passed through to output
data products (Figure S4). If a supported transformation method is
appropriate (interpolation, averaging, and nearest neighbor) the
user does not need to supply any code, but must dene the desired
grid. If the grid is uniform (i.e. the value between coordinate values
is constant) or is the same coordinate grid of a retrieved variable,
the grid can be dened through the interface (Figure S5). To
transform to an irregular grid, the user explicitly denes the elements of the grid through use of a function or via a at le. No
additional code is required. Selection of a preferred and alternative
data source can be based on time (where preferred input changes as
improved data products are brought on line) and location (typically
driven by different input sources being available at different locations). Examples of how data can be manipulated include variable
name changes and transformations. A detailed list of retrieval
criteria that can be set is available in the online Supplementary
materials Table S3.
4.1.2. Datastream denition
In addition to fully dening a process, the other functionality provided by the PCM is to simplify the design and creation

of output data products. New datastreams can be dened


manually by entering attribute information by hand or by
dragging and dropping attributes from output data products
already dened in the PCM (Figure S6). Importing header information from an existing NetCDF le can also create a new
datastream.
ARM, like most scientic data repositories, requires that submitted data conform to standards dened by the program. To
facilitate user adherence to the standards, and the program's ability
to conrm their adherence, the standard validation logic is
embedded in the PCM interface. If a user's output le structure
diverges from expected standards, the violations are agged
(Figure S7).
4.2. Data Consolidator tool
The Data Consolidator is a stand-alone command-line application that consolidates data from different input sources based
on a dened retrieval process and produces the output data
product(s) associated with the retrieval process without
requiring the user to write any source code. If variables have been
dened in an output data product that have not been associated
with a variable retrieved or mapped from an input data source,
the Data Consolidator tool will create these new variables using
the metadata dened for the output data product and populate
them with default ll values. This feature allows users who do
not want to develop software within the ADI framework to use its
data integration to prepare data les that can be used as input
les to their algorithm or programming environment. Their algorithm simply needs to assign real values to the ll values and
does not have to perform transformations or create variables or
attributes.
4.3. Algorithm development framework and template generation
tool
A primary design goal of ADI is to expedite the development of
ARM algorithms that add value through the additional processing
of existing ARM datastreams and to provide atmospheric scientists

Fig. 3. Retrieval Editor Form main form view of process adi_example1. Variables are retrieved from two datastreams, aosmet.a1 and aoscpcf.a1 both of which are publicly available
from the ARM data archive.

K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249

a development environment that will expedite scientic discovery.


Algorithm development is supported for three programming languages: C, Python, and IDL. The steps to develop an ADI algorithm
include:
 Dene a process in the PCM documenting the process inputs,
transformations, and outputs
 Run the Template Generator application to create an initial set of
project les
 Implement logic specic to the algorithm through use of the
appropriate user hook(s)
 Compile (if using the C language) and run the project
 Validate output, returning to earlier steps as necessary.
Following creation of a retrieval process, a source code Template
Generator application is available to create an initial project in the
desired programming language. The templates produced include a
main routine, macros for the input, transformation, output parameters, and function denitions for frequently used hooks. The
project, with no additional logic added, can be compiled and run at
the command line to produce output identical to that of the Data
Consolidator run for the same retrieval process. A user's code is
added using a two-step process: (1) update the main routine to call
the necessary hook functions and (2) add the desired logic to the
body of the hook functions. A hook is available between each of
ADI's main processing modules. The main routine of an ADI algorithm species what PCM process(es) can be executed, and which
user hooks will be invoked during their processing. Example main
routines implemented in C and IDL are illustrated in the online
Supplementary materials: Figure S8 and Figure S9 respectively. A
description of the core modules, user hooks, and supporting functions are described in Section 4.4, ADI Processing Framework.
4.4. ADI processing framework
The ADI processing framework is composed of core modules,
user hooks, and internal data structures. Data is passed between
the core modules and hooks via internal data structures. Internal
structures are used to store retrieved data, transformed data, and
output data. A user data structure, whose content and structure are
dened by the user, is provided to allow developers a mechanism to
share information across the core modules and user hooks.
The ow of data through the core modules and user hooks is
illustrated in Fig. 2. The core modules (shown as green boxes in
Fig. 2) execute the actions necessary to consolidate a group of
diverse datasets into a single output data product. The user hooks
(shown as blue circles) are the functions into which users can insert
their own code to perform scientic analysis and any pre- and postprocessing that are not supported by the core modules. The processing interval is the amount of data that will be pushed through
the pipeline at one time. The retrieve, merge, transform, data creation, and storage core modules are invoked once for each process
interval. The output interval is the maximum amount of data stored
in les generated by the process. The initialize and nish core
modules are executed only once.
The following sections discuss the core modules, user hooks,
and associated data structures that make up the components of the
ADI data pipeline.
4.4.1. Core modules
The Initialize module is invoked rst and sets the stage for the
data processing. Tasks performed by the initialization module
include capturing command line arguments, pulling process
conguration information from the database, opening logs,
initializing input and output datastreams, and building the internal

247

data retrieval structure. After the process has been initialized, the
execution of the modules that are repeated for each processing
interval begins. Each of these modules require several input arguments, including the start and end dates for each process interval,
the most relevant internal data structure, and the user's data
structure.
The Retrieve Data core module transforms units and data types
of the input data source to those specied in the graphical interface
before populating the retrieved data structure containing the
retrieved data transforming units and data types as dictated in the
graphical interface. If more than one input data le exists within a
processing interval, the data is stored in the retrieved data structure
in individual objects for each input data le. Unless the user indicates via a ag not to, the Merge Observation module consolidates instances of multiple individual observations into a single
object that spans the entire processing interval. The Transform
Data module then maps the retrieved data to the coordinate system
grid specied in the Retrieval Editor PCM form, storing the results
in the transformed data structure. How the transformation is
executed is a function of the transformation parameters that have
been set for that coordinate system. In addition to the transformation parameters dened in the graphical interface, additional
parameters can be dened either in conguration les or through
supporting functions. A detailed description of the functionality
and capabilities of the Transform Data module is beyond the scope
of this paper. The parameters are presented in ADI's online documentation at https://engineering.arm.gov/ADI_doc/(Gaustad et al.,
2014a).
Next, the Create Output Datasets module creates an output
data structure for each output data product and populates the
output variables that were mapped from retrieved variables in the
Retrieval Editor Form. Output variables not mapped to a retrieved
variable are created and assigned ll data values. The last core
module invoked for each process interval, the Store Data module,
creates an output NetCDF le(s) for the current processing interval.
The process control is then returned to the pre-retrieval hook for
the next iteration.

Table 3
Available hooks into which developers can access from their algorithms.
User hook

Description

Initialize

 Provides space for the user to perform initialization


not supported by the ADI initialization module
 Instantiates the user data structure, which is subsequently passed to all downstream hooks and modules
 First user function that will be executed for each pass
over the individual process intervals that span the
processing period
 Executes before data is retrieved from input data
sources
 Allows users to perform actions on the retrieved data
before the individual observations that comprise the
current processing interval are merged
 Allows users an opportunity to access data after it has
been merged (making it easier to traverse) but before
any transformations are applied
 Gives users access after the transformations have
been applied
 Frequently used to implement science algorithm
because the pre-processing of input data has
completed
 Executed after all intervals falling from the begin and
end processing dates as specied in the command
line arguments have been completed, but prior to the
execution of the core module Finish process
 Used to free memory allocated for the user data
structure that may have been created or to perform
whatever other cleanup tasks are needed

Pre-retrieval

Post-retrieval

Pre-transformation

Post-transformation
Process data

Finish

248

K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249

Once the Retrieve, Merge, Transform, Create, and Store output


modules have been executed for each of the processing intervalsized chunks of data that span the requested processing period,
the Finish core module updates the database and logs with process
status and metrics, closes logs, sends any email messages generated, and cleans up all allocated memory. Users can access, set, and
change variable values in all of the internal data structures when
and how they want via the provided user hooks.
4.4.2. User hooks
To allow users the opportunity to implement any possible
functionality needed, hooks are available before and after each of
the core modules. The available hooks are summarized in Table 3.
Hooks that execute prior to the Process Data module allow developers to convert data into whatever format is most appropriate
for the desired transformations. For example, to nd an average
wind speed, it is necessary to average the orthogonal u and v wind
components and convert the nal result back to wind speed and
direction. A user would accomplish this by using the Pretransformation hook to go from speed and direction to u and v,
and the Post-transformation hook to convert the results back to
speed and direction. Plots of wind speed and direction on a data
source's original 20-s time-coordinate grid, a transformed oneminute grid, and a link to the source code used to execute the
transformation is available in the online Supplementary materials
area in Figure S10 and Figure S11.
4.4.3. Supporting functions
The purpose of the supporting functions is to provide developers access to any information relating to the process and its
input and output datasets that will help them perform their
analysis. The functions are written in C, but bindings to IDL and
Python have been created. If the IDL or Python bindings are used,
the functions are described in the context of property- and
routine-associated objects to provide users with a programming
environment they expect when using these languages. Descriptions of the available functions and objects are described in
the online documentation (https://engineering.arm.gov/ADI_doc/
library.html).
5. Conclusions
The ARM Data Integrator (ADI) framework automates data
retrieval, preparation, and creation, and provides a structured algorithm development environment that expedites the development of algorithms and improves their standardization. The Data
Consolidator application allows users to execute retrieval, transformation, and data creation specications dened through a
graphical interface in a single command. Non-ADI algorithms can
use the Data Consolidator to create input les with the desired
coordinate grid, units, variable names, metadata, and placeholder
values for variables whose values will be calculated by the algorithm. This removes all data preparation and creation logic from the
algorithms, allowing them to read only the variables they need to
perform their calculations and then updating the existing ll values
with the results. The Data Consolidator application can also be used
to create simplied inputs for more sophisticated programming
environments such as MATLAB and R. It can create inputs to statistical, mathematical, and scientic functionalities supported by
these and other programming languages and by libraries of command line operators such as Climate Data Operators (CDO) and
NetCDF Operators (NCO). Together the ADI development environment and Data Consolidator provide a niche solution well suited for
implementing robust, production-ready software. Their modular
design, implementation in C, and bindings to Python allow them to

use and be used by tools with more architectural exibility and


scientic libraries. The Atmospheric Radiation Measurement (ARM)
program has reaped signicant savings by decreasing the time
needed to pre-process the various inputs into the format and coordinate system needed for its scientic algorithms. The effort
required to create a complex algorithm with inputs from several
datastreams on different grids has been reduced from weeks of
development time to a few days. While not as easily quantied, an
equally important benet is the move away from individual developers implementing their own methods of conditioning input
data to a set of versatile, robust, and well-tested library routines.
This transition has resulted in substantially fewer problems being
found in ARM's evaluation data sets, and decreased the time
needed to integrate new algorithms into the ARM production data
processing system.
6. Software availability
The PCM demonstration software can be accessed free of charge
at https://engineering.arm.gov/pcm/Main.html (Gaustad et al.,
2014b). This provided link is to the interface currently used by
ARM infrastructure to manage ongoing processing. Without signing
in, the read-only link serves to demonstrate the current capabilities
as well as to display live data product and processing congurations. Instructions for downloading a RHEL6 build and source code
is available through a modied BSD license at https://github.com/
ARM-DOE/ADI.
Acknowledgments
This research was supported by the Ofce of Biological and
Environmental Research of the U.S. Department of Energy under
Contract No DE-AC05d76RL01830 as part of the Atmospheric Radiation Measurement Climate Research Facility.
This project took advantage of netCDF software developed by
UCAR/Unidata (www.unidata.ucar.edu/software/netcdf/).
Appendix A. Supplementary data
Supplementary data related to this article can be found at http://
dx.doi.org/10.1016/j.envsoft.2014.06.005.
References
Abiteboul, S., Cluet, S., Milo, T., Mogilevsky, P., Sime on, J., Zohar, S., 1999. Tools for
data translation and integration. Bull. Tech. Comm. Data Eng. 22 (1), 3e8.
Ames, D.P., Horsburgh, J.S., Cao, Y., Kadles, J., Whiteaker, T., Valentine, D., 2012.
HydroDesktop: web services-based software for hydrologic data discovery,
download, visualization, and analysis. Environ. Model. Softw. 37, 146e156.
Apache Software Foundation, 2014. Apache Flex. The Apache Software Foundation
(accessed 22. 04. 14.). http://ex.apache.org.
Barseghian, D., Altintas, I., Jones, M., Crawl, D., Potter, N., Gallagher, J., Cornillon, P.,
Schildhauer, M., Borer, E., Seabloom, E., Hosseini, P., 2010. Workows and extensions to the Kepler scientic workow system to support environmental
sensor data access and analysis. Ecol. Inform. 5 (1), 42e50.
Bernholdt, D., Bharathi, S., Brown, D., Chanchio, K., Chen, M., Chervenak, A.,
Cinquini, L., Drach, B., Foster, I., Fox, P., Garcia, J., Kesselman, C., Markel, R.,
Middleton, D., Nefedova, V., Pouchard, L., Shoshani, A., Sim, A., Strand, G.,
Williams, D., Williams, D., 2005. The earth system grid: supporting the next
generation of climate modeling research. Proc. IEEE 93 (3), 485e495.
Berrick, S.W., Leptoukh, G., Farley, J.D., Rui, H., 2008. Giovanni: a web service
workow-based data visualization and analysis system. IEEE Trans. Geosci.
Remote Sens. 47 (1), 106e113.
Cermak, J., Bendix, J., Dobbermann, M., 2008. FMetean integrated framework for
Meteosat data processing for operational scientic applications. Comput. Geosci
34 (11), 1638e1644.
Cheng, J., Lin, X., Zhou, Y., Li, J., 2009. A web based workow system for distributed
atmospheric data processing. In: 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications. http://dx.doi.org/10.1109/
ISPA, 2009.30. IEEE Computer Society.

K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249


Claypool, K., Rundensteiner, E., 1999. Flexible database transformations: the SERF
approach. Bull. Tech. Comm. Data Eng. 22 (1), 19e24.
Coulais, A., Schellens, M., Gales, J., Arabas, S., Boquien, M., Chanial, P., Messmer, P.,
Fillmore, D., Poplawski, O., Maret, S., Marchal, G., Galmiche, N., Mermet, T., 2011.
Status of GDL-GNU Data Language arXiv preprint arXiv:1101.0679.
Davis, G., Rew, R., Hartnett, E., Caron, J., Heimbigner, D., Emmerson, S., Davies, H.,
Fisher, W., 2014. NetCDF. Unidata Program Center, Boulder, Colorado (accessed
22. 04. 14.). http://www.unidata.ucar.edu/software/netcdf/.
Doutriaux, C., Taylor, K., 2011. Climate Model Output Rewriter (CMOR). Program for
Climate Model Diagnosis and Intercomparison (PCMDI) (accessed 22. 04. 14.).
http://www2-pcmdi.llnl.gov/cmor/index1_html.
Eaton, J.W., Bateman, D., Hauberg, S., 1997. GNU Octave. Free Software Foundation.
Ellingson, R.G., Wiscombe, W.J., 1996. The spectral radiance experiment (SPECTRE):
project description and sample results. Bull. Am. Meteorol. Soc. 77 (9),1967e1985.
Eloranta, E.E., 2005. High Spectral Resolution Lidar. Springer New York,
pp. 143e163.
Gaustad, K.G., Shippert, T., Ermold, B., Beus, S., Daily, J., 2014a. ARM Data Integrator
(ADI) Documentation. Atmospheric Radiation Measurement (ARM) Climate
Research Facility (accessed 22. 04. 14.). https://engineering.arm.gov/ADI_doc/.
Gaustad, K.G., Shippert, T., Ermold, B., Beus, S., Daily, J., Borsholm, A., Fox, K., 2014b.
ARM Processing Conguration Manager (PCM) Application. Atmospheric Radiation Measurement (ARM) Climate Research Facility (accessed 22. 04. 14.).
https://engineering.arm.gov/pcm/Main.html.
Nelson, E.K., Piehler, B., Eckels, J., Rauch, A., Bellew, M., Hussey, P., Ramsay, S.,
Nathe, C., Lum, K., Krouse, K., Steams, D., Connolly, B., Skillman, T., Igra, M., 2011.
LabKey Server: an open source platform for scientic data integration, analysis,
and collaboration. BMC Bioinform. 12 (1), 71e93.
Pacic Northwest National Laboratory, 2014. Institutional Computing Program
(accessed 22. 04. 14.). http://pic.pnnl.gov/.
PostgreSQL, 2014. PostgreSQL. The PostgreSQL Global Development Group
(accessed 22. 04. 14.). http://www.postgresql.org/.
Rew, R.K., Davis, G.P., 1990. NetCDF: an interface for scientic data access. IEEE
Comput. Graph. Appl. 10 (4), 76e82.
Rudd, T., Orr, M., Bicking, I., Esterbrook, C., 2001. Cheetah, the Python-Powered
Template Engine. The Cheetah Development Team (accessed 22. 04. 14.).
http://www.cheetahtemplate.org.
Stokes, G.M., Schwartz, S.E., 1994. The atmospheric radiation measurement (ARM)
program: programmatic background and design of the cloud and radiation test
bed. Bull. Am. Meteorol. Soc. 75 (7), 1201e1221.

249

Team, R.C., 2005. R: a Language and Environment for Statistical Computing. R


Foundation for Statistical Computing.
Turner, D.D., Tobin, D.C., Clough, S.A., Brown, P.D., Ellingson, R.G., Mlawer, E.J.,
Knuteson, R.O., Revercomb, H.E., Shippert, T.R., Smith, W.L., Shephard, M.W.,
2004. The QME AERI LBLRTM: a closure experiment for downwelling high
spectral resolution infrared radiance. J. Atmos. Sci. 61 (22).
Turner, D.D., Clough, S.A., Liljegren, J.C., Clothiaux, E.E., Cady-Pereira, K.E.,
Gaustad, K.L., 2007. Retrieving liquid water path and precipitable water vapor
from the atmospheric radiation measurement (ARM) microwave radiometers.
Geosci. Remote Sens. IEEE Trans. 45 (11), 3680e3690.
Turuncoglu, U.U., Dalfes, N., Murphy, S., DeLuca, C., 2013. Toward self-describing and
workow integrated earth system models: a coupled atmosphere-ocean
modeling system application. Environ. Model. Softw. 39, 247e262.
Unidata, 2011. UDUNITS. University Corporation for Atmospheric Research (UCAR),
Boulder, Colorado (accessed 22. 04. 14.). http://www.unidata.ucar.edu/software/
udunits.
Van Rossum, G., Drake, F.L., 2003. Python Language Reference Manual. Network
Theory.
Wang, D.L., Zender, C.S., Jenks, S.F., 2009. Efcient clustered server-side data analysis workows using SWAMP. Earth Sci. Inform. 2 (3), 141e155.
Whelan, G., Kim, K., Pelton, M.A., Castleton, K.J., Laniak, G.F., Wolfe, K., Parmar, R.,
Babendreier, J., Galvin, M., 2014. Design of a component-based integrated
environmental modeling framework. Environ. Model. Softw. 55, 1e24.
Williams, D.N., Doutriaux, C.M., Drach, R.S., McCoy, R.B., 2009. The Flexible Climate
Data Analysis Tools (CDAT) for Multi-model Climate Simulation Data. Data
Mining Workshops 2009. ICDMW 09. IEEE International Conference on,
pp. 254e261.
Woolf, A., Cramer, R., Gutierrez, M., Dam, K.K.V., Kondapalli, S., Latham, S.,
Lawrence, B., Lowry, R., O'Neill, K., 2005. Standards e based data interoperability in the climate sciences. Meteorological Applications 12 (1), 9e22.
Xie, S., Jensen, M., McCoy, R.B., Klein, S.A., Cederwall, R.T., Wiscombe, W.J.,
Clothiaux, E.E., Gaustad, K.L., Golaz, J.-C., Hall, S., Johnson, K.L., Lin, Y., Long, C.N.,
Mather, J.H., McCord, R.A., McFarlane, S.A., Palanisamy, G., Shi, Y., Turner, D.D.,
2010. ARM climate modeling best estimate data. Bull. Am. Meteorol. Soc. 91 (1),
13e20.
Zender, C.S., 2008. Analysis of self-describing geoscience data with netCDF operators (NC). Environ. Model. Softw., 1338e1342.

Você também pode gostar