Escolar Documentos
Profissional Documentos
Cultura Documentos
Pacic Northwest National Laboratory, 902 Battelle Boulevard, P.O. Box 999 MSIN K7-28, Richland, WA 99352, United States
Exelis Visual Information Solutions, Inc., 4990 Pearl East Circle, Boulder, CO 80301, United States
a r t i c l e i n f o
a b s t r a c t
Article history:
Received 24 August 2013
Received in revised form
2 June 2014
Accepted 7 June 2014
Available online 9 July 2014
The Atmospheric Radiation Measurement (ARM) Data Integrator (ADI) is a framework designed to
streamline the development of scientic algorithms that analyze, and models that use time-series
NetCDF data. ADI automates the process of retrieving and preparing data for analysis, provides a
modular, exible framework that simplies software development, and supports a data integration
workow. Algorithm and model input data, preprocessing, and output data specications are dened
through a graphical interface. ADI includes a library of software modules to support the workow, and a
source code generator that produces C, IDL, and Python templates to jump start development. While
developed for processing climate data, ADI can be applied to any time-series data. This paper discusses
the ADI framework, and how ADI's capabilities can decrease the time and cost of implementing scientic
algorithms allowing modelers and scientists to focus their efforts on their research rather than preparing
and packaging data.
2014 Elsevier Ltd. All rights reserved.
Keywords:
Atmospheric science
Time-series
NetCDF
Scientic data analysis
Observation data
Scientic workow
Data management
1. Introduction
Since 1992, the U. S. Department of Energy's Atmospheric Radiation Measurement (ARM) program (Stokes and Schwartz, 1994)
has been collecting, processing into Network Common Data Form
(NetCDF) format, and distributing data from highly instrumented
ground stations. The instrumentation is positioned across the globe
in both permanent and mobile facilities. The program maintains a
production data processing center that ingests the data collected
from its instruments, and creates higher quality data and data more
scientically relevant to achieving its goal of using the program's
data to evaluate and improve global climate models (GCMs). The
higher level data products, referred to as Value Added Products
(VAPs), are created by applying increasingly more advanced analysis techniques to existing data products. Examples include precipitable water vapor and liquid water path retrievals that improve
the modeling of the diabatic feedback from clouds in GCMs by
improving the understanding of the impact of clouds on the radiative ux (Turner et al., 2007), and a closure experiment designed
to analyze and improve ARM's Line-by-Line Radiative Transfer
Model (LBLRTM) and the spectral line parameters it uses (Turner
242
(Eaton et al., 1997), S Plus, and its GNU version R (Team, 2005). Two
widely used le-based operator libraries include NetCDF Operators
(NCO) (Zender, 2008) and Climate Data Operators (CDO) (https://
code.zmaw.de/projects/cdo). The NetCDF Operators (NCO) is a
widely used set of low-level functions that supports the analysis of
self-describing gridded geoscience data. Many of the NCO and CDO
functions are similar to operations performed through the ADI GUI
(such as reading, writing, interpolating, and averaging). But these
operators, along with functionality not supported by ADI, could be
accessed from within the ADI framework as a useful alternative and
supplement to the methods already supported by ADI. Several
existing frameworks use NCO, such as the Script Workow Analysis
for Multi-Processing (SWAMP) system, which is a framework that
uses shellescript interfaces to optimize data analysis by running at
data centers, thus avoiding the overhead associated with moving
around big data sets (Wang et al., 2009).
An evaluation of existing tools and approaches was conducted in
the spring of 2009 to determine whether an existing system could
be used or leveraged to meet the architectural, standardized software development environment, data retrieval, transformation,
and creation requirements that ARM had determined were necessary to achieve the desired savings in algorithm development time
and cost. Not surprisingly, no single system was found that met the
program's needs. Existing solutions tended to focus either on
providing a exible architecture to integrate workow components
and data analysis tools, or on providing low-level tools that access,
manipulate, or create data. ADI falls in between these two paradigms, in that it didn't need most of the capabilities provided by
available architectures. The low-level data retrieval, manipulation,
and transformation libraries best suited for ARM's requirements
were designed to either work with NetCDF le inputs and outputs
(which makes them available to diverse users and systems, but
does not allow them to efciently be invoked from within an algorithm) or through environments well suited for design. These
libraries were not well suited for production data processing and
were generalized to all gridded systems (e.g. MATLAB and S Plus)
requiring some additional effort to apply them to time-series data.
Scientic workow applications, such as Eclipse and Kepler
(https://kepler-project.org), appeared particularly well suited for
the design and prototyping of algorithms ARM required. Recognizing this, the initial ADI prototype was implemented as a plug-in
for the Eclipse geographic software development environment and
extensible plug-in system (http://www.eclipse.org/) integrated
development environment. However, testing in the prototype stage
revealed that users were not making use of the platform. The
workow components needed to automate pre and post data
processing and jump start algorithm development are static and
follow a known order. This realization and the recognition of unneeded advanced features relating to visualizing data increased the
complexity and decreased efciency more than the benets gained.
As a result, the Eclipse environment was discontinued and scripts
are now used to execute the standard workow to minimize
complexity and improve processing efciencies. This does not
preclude the workow from being ported to a more exible
framework and the scripts can be replaced by calling the ADI
components from a system that supports user-dened workows.
ADI users can incorporate I/O-based operator libraries such as
CDO and NCO that provide mathematical, statistical, and data
transformation capabilities into their processing through the use of
functions provided to allow users to create intermediate output
les within the workow. These intermediate les could be used as
input to the operator library functions and the results pulled back
into the ADI process. Functionalities provided by supported languages are also readily available. Users who prefer working within
programming environments such as MATLAB can use the Data
243
2
3
244
Table 1
Comparison of ADI to alternative framework architectures.
Fmet
HSRL
Gio-vanni
ESG
CDAT
SWAMP
REAP
ADI
Y
Y
Y
N
Y
Y
N
Y
Fortran
N
?
web app
N/A
N
Y(limited)
N
Y
N/A
Y
N/A
N/A
N/A
web app
N/A
N
N
N
Y
N
Y
N/A
N
N/A
web app
N/A
Y(via CDAT)
Y(via CDAT)
Y(via CDAT)
Y(via CDAT)
Y(via CDAT)
via CDAT
Y(via CDAT)
Y(via CDAT)
Y(via CDAT)
Y
Y
Y
Y
Y
Y
Y
Y
PythonJavaC/CFortran
Y
Y
Y
Y
Y
Y(via NCO)
Y
N
Y
N
NCO
no
Y(via NCO)
N
Y
Y
N
Y
N
N
N
N/A
N
?
Y
Y
Y
Y
N
Y
Y
CIDLPython
Y
Y(limited)
3.5. SASL
Simple Authentication and Security layer (SASL), version 2.1.23, is used for the
MD5 functions. ADI will be updated to use OpenSSL in the future.
3.6. Apache Flex
The browser-based Process Conguration Manager (PCM) graphical interface
was developed using the Apache Flex4 (Apache Software Foundation, 2014) software development kit but only the Adobe Flash Player is needed to run it.
3.7. Cheetah
To jump-start user algorithm development, the ADI template source code
generator uses the open source template engine and code generation tool, Cheetah
(Rudd et al., 2001) to generate the C, IDL, and Python data integration algorithms
with user hooks.
3.8. System architecture
The ADI libraries and applications have been tested with Red Hat Enterprise
Linux 2.5 and 2.6. The current ADI libraries have only been compiled and tested
for the Red Hat Enterprise Linux 2.5 and 2.6 operating systems, but they should
be able to run under any non-proprietary operating system (e.g. GNU/Linux,
BSD) that supports the previously discussed libraries and packages. The beta
version of the libraries was developed and tested for the Fedora5 13 operating
system.
3.9. IDL
To develop algorithms in IDL, version 8.2 is needed to support 64-bit.
3.10. Logs and metrics-related components
ADI's data system database (DSDB) is implemented using the PostgreSQL 8.4
open-source Object-Relational database system (PostgreSQL, 2014). ADI can be
congured to build without it if only the Web Service backend is needed.
ADI has a built-in logging and messaging system. Most functions that ADI performs automatically (i.e. all the green boxes (in the web version) in Fig. 1.) will be
logged and unusual circumstances such as missing les or system issues will be
identied. ADI also provides standard tools for the developer to log and classify their
messages as regular logs, warnings, or errors. In this way every ADI process will have
a standardized log, which will allow for the development of automated methods for
tracking provenance, benchmarking, or data mining to track when and how often
specic events occur.
3.4. LIBCURL
4. Results
Table 2
Comparison of ADI to alternative data processing libraries and tools.
IDL Python MatLab, S Plus, CMOR NCO CDO ADI
Octave R
command line interface
scriptable
reads NetCDF
writes NetCDF
transformation functions
integrate data through GUI
provides mathematical,
statistical functions
Y
Y
Y
Y
N
N
Y
Y
Y
Y
Y
N
N
Y
Y
Y
Y
Y
N
N
Y
Y
Y
Y
Y
N
N
Y
N
Y
N
Y
N
N
N
Y
Y
Y
Y
Y
N
Y
Y
Y
Y
Y
Y
N
Y
N
Y
Y
Y
Y
Y
N
The Atmospheric Radiation Measurement (ARM) Data Integrator (ADI) is a suite of tools, libraries, data structures, and interfaces that simplify the development of scientic algorithms that
process time-series data. It minimizes the amount of source code
needed to implement an algorithm by automating data retrieval
and the creation of output data products, allowing development
efforts to be focused on the science. Built-in integration capabilities
include merging data, applying data type and unit conversions, and
coordinate dimension transformations. ADI's web-based graphical
interface, the Process Conguration Manager (PCM), is used to
4
Apache Flex and Apache are registered trademarks of The Apache Software
Foundation.
5
Fedora is a trademark of the Fedora Project.
245
Fig. 1. ARM ADI framework, Data System DataBase (DSDB) and related components.
custom logic. The hooks, designed to allow users to access any point
in the framework, are represented in Fig. 2 as blue circles. The
remainder of this section describes the components of the ADI
framework in more detail.
4.1. Conguring a process
The Process Conguration Manager (PCM) is the graphical user
interface through which processes and datastreams are dened,
viewed, and edited. This interface simplies the development of
new ARM algorithms and data products by providing access to
existing ARM datastreams and variables for use in dening the
input and output of a new process.
Dening a process includes specifying the input data that needs
to be retrieved, the output datastream(s) that will be created
(Figure S1), the mapping of input data that is passed through to the
output, process specications such as the size of the data chunks
sent through the data integration workow, and the size of the data
products produced. ADI can also transform all input data onto a
246
Fig. 3. Retrieval Editor Form main form view of process adi_example1. Variables are retrieved from two datastreams, aosmet.a1 and aoscpcf.a1 both of which are publicly available
from the ARM data archive.
247
data retrieval structure. After the process has been initialized, the
execution of the modules that are repeated for each processing
interval begins. Each of these modules require several input arguments, including the start and end dates for each process interval,
the most relevant internal data structure, and the user's data
structure.
The Retrieve Data core module transforms units and data types
of the input data source to those specied in the graphical interface
before populating the retrieved data structure containing the
retrieved data transforming units and data types as dictated in the
graphical interface. If more than one input data le exists within a
processing interval, the data is stored in the retrieved data structure
in individual objects for each input data le. Unless the user indicates via a ag not to, the Merge Observation module consolidates instances of multiple individual observations into a single
object that spans the entire processing interval. The Transform
Data module then maps the retrieved data to the coordinate system
grid specied in the Retrieval Editor PCM form, storing the results
in the transformed data structure. How the transformation is
executed is a function of the transformation parameters that have
been set for that coordinate system. In addition to the transformation parameters dened in the graphical interface, additional
parameters can be dened either in conguration les or through
supporting functions. A detailed description of the functionality
and capabilities of the Transform Data module is beyond the scope
of this paper. The parameters are presented in ADI's online documentation at https://engineering.arm.gov/ADI_doc/(Gaustad et al.,
2014a).
Next, the Create Output Datasets module creates an output
data structure for each output data product and populates the
output variables that were mapped from retrieved variables in the
Retrieval Editor Form. Output variables not mapped to a retrieved
variable are created and assigned ll data values. The last core
module invoked for each process interval, the Store Data module,
creates an output NetCDF le(s) for the current processing interval.
The process control is then returned to the pre-retrieval hook for
the next iteration.
Table 3
Available hooks into which developers can access from their algorithms.
User hook
Description
Initialize
Pre-retrieval
Post-retrieval
Pre-transformation
Post-transformation
Process data
Finish
248
249