Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Informatics for Materials Science and Engineering: Data-driven Discovery for Accelerated Experimentation and Application
Informatics for Materials Science and Engineering: Data-driven Discovery for Accelerated Experimentation and Application
Informatics for Materials Science and Engineering: Data-driven Discovery for Accelerated Experimentation and Application
Ebook937 pages30 hours

Informatics for Materials Science and Engineering: Data-driven Discovery for Accelerated Experimentation and Application

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

Materials informatics: a ‘hot topic’ area in materials science, aims to combine traditionally bio-led informatics with computational methodologies, supporting more efficient research by identifying strategies for time- and cost-effective analysis.

The discovery and maturation of new materials has been outpaced by the thicket of data created by new combinatorial and high throughput analytical techniques. The elaboration of this "quantitative avalanche"—and the resulting complex, multi-factor analyses required to understand it—means that interest, investment, and research are revisiting informatics approaches as a solution.

This work, from Krishna Rajan, the leading expert of the informatics approach to materials, seeks to break down the barriers between data management, quality standards, data mining, exchange, and storage and analysis, as a means of accelerating scientific research in materials science.

This solutions-based reference synthesizes foundational physical, statistical, and mathematical content with emerging experimental and real-world applications, for interdisciplinary researchers and those new to the field.

  • Identifies and analyzes interdisciplinary strategies (including combinatorial and high throughput approaches) that accelerate materials development cycle times and reduces associated costs
  • Mathematical and computational analysis aids formulation of new structure-property correlations among large, heterogeneous, and distributed data sets
  • Practical examples, computational tools, and software analysis benefits rapid identification of critical data and analysis of theoretical needs for future problems
LanguageEnglish
Release dateJul 10, 2013
ISBN9780123946140
Informatics for Materials Science and Engineering: Data-driven Discovery for Accelerated Experimentation and Application

Related to Informatics for Materials Science and Engineering

Related ebooks

Materials Science For You

View More

Related articles

Reviews for Informatics for Materials Science and Engineering

Rating: 5 out of 5 stars
5/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Informatics for Materials Science and Engineering - Krishna Rajan

    1

    Materials Informatics

    An Introduction

    Krishna Rajan,    Dept. of Materials Science & Eng. and Bioinformatics & Computational Biology Program – Iowa State University, Ames IA, USA

    1 The What and Why of Informatics

    The search for new or alternative materials or processing strategies, whether through experiment or simulation, has been a slow and arduous task, punctuated by infrequent and often unexpected discoveries. Each of these findings prompts a flurry of studies to better understand the underlying science governing the behavior of these materials. While informatics is well established in fields such as biology, drug discovery, astronomy and quantitative social sciences, materials informatics is still in its infancy. The few systematic efforts that have been made to analyze trends in data as a basis for predictions have in large part been inconclusive, not least of which is due to the lack of large amounts of organized data and even more importantly the challenge of sifting through them in a timely and efficient manner.

    When combined with a huge combinatorial space of chemistries as defined by even a small portion of the periodic table, it is clearly seen that searching for new materials with tailored properties is a prohibitive task. Hence the search for new materials for new applications is limited to educated guesses. Data that does exist is often limited to small regions of compositional space. Experimental data is dispersed in the literature and computationally derived data is limited to a few systems for which reliable data exists for calculations. Even with recent advances in high-speed computing, there are limits to how the structure and properties of many new materials can be calculated. Hence this poses both a challenge and an opportunity. The challenge is to deal with extremely large disparate databases and large-scale computation. It is here that knowledge discovery in databases or data mining, an interdisciplinary field merging ideas from statistics, machine learning, databases, and parallel and distributed computing, provides a unique tool to integrate scientific information and theory for materials discovery. The key challenge in data mining is the extraction of knowledge and insight from massive databases. It takes the form of discovering new patterns or building models from a given data set. The challenge is to take advantage of recent advances in data mining and apply them to state-of-the-art computational and experimental approaches for materials discovery.

    A complete materials informatics program will have an information technology (IT)-based component that is linked to classical materials science research strategies. The former includes a number of features that help informatics to be critical in materials research (Figure 1.1):

    • Data warehousing and data management: This involves a science-based selection and organization of data that is linked to a reliable data searching and management system.

    • Data mining: Providing an accelerated analysis of large multivariate correlations.

    • Scientific visualization: A key area of scientific research that allows high dimensional information to be assessed.

    • Cyber infrastructure: An information technology infrastructure that can accelerate the sharing of information, data, and most importantly knowledge discovery.

    Figure 1.1 The role of materials informatics is pervasive across all aspects of materials science and engineering. The mathematical tools based on data mining provide the computational engine for integrating materials science information across length scales. Informatics provides an accelerated means of fusing data and recognizing in a rapid yet robust manner structure–property relationships between disparate length and time scales. (From Rajan, 2005.)

    2 Learning from Systems Biology: An Omics Approach to Materials Design

    The concept of complexity in biology and how to assess the links between information at the molecular level to that at the living organism level (e.g. genomics, proteomics, etc.) is the foundation of systems biology. The understanding of systems biology provides an excellent paradigm for the materials scientist. Ultimately one would like to take an atoms applications approach to materials design. How do we organize atoms and build systematically structural units at increasing length scales to ultimately the final engineering component or structure? At present we need to rely on extensive prior knowledge with experiments, computation, and ultimately even failure analysis to understand the complex network of interactions of materials behavior that govern the performance of an engineering system. The problem is that even with advanced experimental and computational tools, the rate of discovery is still slow, only punctuated by unexpected findings (e.g. superconducting ceramics, conducting polymers) that stimulate new areas of research and development. The iterative approach as shown in Figure 1.2 is common to many fields as one tries to link observations with models. The challenge is to develop models that capture the system behavior by accounting for all the different levels of information that contribute to the system’s behavior.

    Figure 1.2 Logic of information flow and and knowledge discovery in classical research methodology. The example provided here addresses the use of qualitative reasoning to simulate and identify metabolic pathways. (From King et al., 2005.)

    The goal of modern systems biology is to understand physiology and disease from the level of molecular pathways, regulatory networks, cells, tissues, organs and ultimately the whole organism (Butcher et al., 2004). As currently employed, the term systems biology encompasses many different approaches and models for probing and understanding biological complexity, and studies of many organisms from bacteria to man. A similar paradigm exists for materials, e.g. atoms to airplanes (Figure 1.3).

    Figure 1.3 Comparison of length scale challenges in designing drugs for the life sciences (a; from Butcher et al., 2004) and designing materials for the engineering sciences (b; from Noor et al., 2000). Note the overlap in the length scales that govern engineering design/human body.

    As aptly described by Butcher et al., the omics (the bottom-up approach) focuses on the identification and global measurement of molecular components. Modeling (the top-down approach) attempts to form integrative (across scales) models of human physiology and disease although, with current technologies, such modeling focuses on relatively specific questions at particular scales, e.g. at the pathway or organ levels. An intermediate approach, with the potential to bridge the two, is to generate profiling data from high-throughput experiments designed to incorporate biological complexity at multiple levels: multiple interacting active pathways, multiple intercommunicating cell types, and multiple different environments.

    A similar challenge occurs in materials science, identifying pathways of how chemistry, crystal structure, microstructure, processing variables, and component design and manufacturing communicate with each other to ultimately define performance. This forms the materials science equivalent of the biological regulatory network (Figure 1.4).

    Figure 1.4 An example of a regulatory network linking diverse sets of information from both theory and experiment in the study of materials degradation due to irradiation. (From Wirth et al., 2001.)

    Because biological complexity is an exponential function of the number of system components and the interactions between them, and escalates at each additional level of organization (Figure 1.5), such efforts are currently limited to simple organisms or to specific minimal pathways (and generally in very specific cell and environmental contexts) in higher organisms. The same can be said of complexity in materials science.

    Figure 1.5 Identification of regulatory pathways using network graphing and simulated annealing methods. (This figure has been reprinted from Ideker et al., 2002, by permission of Oxford University Press, and Aitchison and Galitski, 2003.)

    Even if our ability to measure molecules and their functional states and interactions were adequate to the task, computational limitations alone would prohibit our understanding of cell and tissue behavior at the molecular level. Thus, methodologies that filter information for relevance, such as biological context and experimental knowledge of cellular and higher level system responses, will be critical for successful understanding of different levels of organization in systems biology research.

    Informatics is the enabling tool to facilitate this process. For instance, Csete and Doyle (2002) have provided a very appropriate analog between biological and engineering systems that can help to put materials informatics in perspective. A striking example of converging information across length scales is shown in Figure 1.6, where one is comparing cruise speed to mass M over 12 orders of magnitude, from 747 and 777, to fruit flies. This provides a good example of how, if one can integrate and identify key metrics (data), functional relations between variables across many scales can be developed. Here, a well-known elementary argument shows good correspondence with the data and yields explanations for deviations.

    Figure 1.6 A scaling diagram of design behavior or aeronautical systems (both biological and artificial). (Adapted from Csete and Doyle, 2002.)

    Such theories are largely irrelevant to complexity directly, but an understanding of them leads to what is relevant. The scaling theory described by the above figure does not distinguish between flight in the atmosphere and in a laboratory wind tunnel. In the latter context, a much simpler mutant 777 with nearly all of its 150,000-count aeronome knocked out would have roughly the same lift, mass, and cruise speed, and thus (from an allometric scaling viewpoint) would exhibit no deleterious laboratory phenotype. Redundancy does not explain this finding. Rather, the mutant has lost control systems and robustness required for real flight outside the lab. Allometric scaling emphasizes the essential similarities between these 777 variants and a toy scale model (and a fruit fly), whereas our interest is in their huge differences in complexity. Similarly, minimal cellular life requires a few hundred genes, yet even Escherichia coli have ~4000 genes, less than 300 of which have been classified as essential. The likely reason for this excess complexity is also the presence of complex regulatory networks for robustness. In technology as well as in organisms, such robustness tradeoffs drive the evolution of spiraling complexity.

    3 Where Do We Get the Information?

    One may naturally assume that having large amounts of data is critical for any serious informatics studies. What constitutes enough data in materials science applications, however, can vary significantly. In studying structural ceramics, for instance, fracture toughness measurements are difficult to make and in some of the more complex materials just a few careful measurements can be of great value. Similarly, having reliable measurements on fundamental constants or properties for a given material involves very detailed measurement and/or computational techniques. In essence data sets in materials science fall into two broad categories. The first is data sets on a given material’s behavior as related to mechanical or physical properties. The other is data sets related to intrinsic information based on the chemical characteristic of the material, such as thermodynamic data sets.

    In the materials science community, crystallographic and thermochemical databases historically have been two of the best established in the community. The former serves as the foundation for interpreting crystal structure data of metals, alloys, and inorganic materials. The latter involves the compilation of fundamental thermochemical information in terms of heat capacity and calorimetric data. While crystallographic databases are primarily used as a reference source, thermodynamic databases were actually one of the earliest examples of informatics, as these databases were integrated into thermochemical computations to map phase stability in binary and ternary alloys. This led to the development of computationally derived phase diagrams, which is a classic example of integrating information in databases to data models. The evolution of both databases has occurred independently, although in terms of their scientific value they are extraordinarily intertwined. Phase diagrams map out regimes of crystal structure in temperature–composition space or temperature–pressure space. However, crystal structure databases have been developed totally independently. At present the community has to work with each database separately and information searches are cumbersome and data analysis interpretation involving both is very difficult. Researchers only integrate such information independently for a very specific system at a time based on their individual interests. Hence there is at present no unified way to explore patterns of behavior across both databases, which are scientifically related. Examples do exist in the biological and chemical sciences and provide useful templates for materials science (Figure 1.7).

    Figure 1.7 Example of integrating databases into knowledge discovery in drug discovery from information at the genomic level. (Adapted from Strausberg and Schreiber, 2003.)

    One of the more systematic efforts to address this challenge has been that of Ashby (see, for example, Figure 1.8, Ashby, 2011), who showed how, by merging phenomenological relationships in materials properties with discrete data on specific material characteristics, one can begin to develop patterns of classification of materials behavior. The visualization of multivariate data was managed by using normalization schemes, which permitted the development of maps that, in turn, provided a means of capturing new means of clustering of materials properties. It also provided a methodology to establish common structure–property relationships across seemingly different classes of materials. This approach, while very valuable, is limited in its predictive value and is ultimately based on utilizing prior models to build and seek relationships. In the informatics strategy of studying materials behavior we are approaching it from a broader perspective. By exploring all types of data that may have varying degrees of influence on a given property (or properties) with no prior assumptions, one utilizes data-mining techniques to establish both classification and predictive assessments in materials behavior. However, this is not done from a purely statistical perspective but one where one carefully integrates a physics-driven approach to data collection with data mining; it is validated or analyzed with theory-based computation and/or experiments.

    Figure 1.8 An example of data integration in materials engineering mapping correlations between mechanical properties over a wide range of materials: Fracture toughness and modulus for metals, alloys, ceramic, glasses, polymers and metallic glasses. The contours show the toughness Gc in kJ/m². (From Ashby and Greer, 2006.)

    The origins of the data can be either from experiment or computation and the former, when organized in terms of combinatorial experiments, can provide an opportunity to screen large amounts of data in a high-throughput fashion (see Figure 1.9).

    Figure 1.9 Two examples of assessing large arrays of data in a microarray format.

    (A) A correlation matrix showing effects of blood pressure control. (From Stoll et al., 2001.)

    (B) An experimental array of thin-film chemistries showing empirical correlations between optical behavior and chemistry. (From Liu and Schultz, 1999.)

    The materials informatics pathway to knowledge discovery, however, is not a linear one but rather an iterative process that can provide new information at each information cycle (Figure 1.10).

    Figure 1.10 A comparison of a hypothesis-driven strategy with a systems biology approach (A; adapted from Kitano, 2002) with a systems engineering approach for materials development (B; from Noor et al., 2000).

    As described by Kitano (2002), in terms of biological research, a cycle of research begins with the selection of contradictory issues of biological significance and the creation of a model representing the phenomenon. Models can be created either automatically or manually. The model represents a computable set of assumptions and hypotheses that need to be tested or supported experimentally. A similar analogy may be applied to materials science in trying to explain unexpected or unusual materials behavior such as the discovery, nearly two decades ago, of high-temperature superconductivity exhibited by oxide systems. Up to that point, the majority (but not all) of research in this field was focused on intermetallics. The new discovery at the time spawned a vast array of studies both experimental and theoretical to gain better understanding of the causes of this important materials behavior. This of course was part of a cycle of hypothesis-driven research in superconductivity that has had a long and distinguished history. The computational simulations (biologists refer to them as dry experiment) on models reveal computational adequacy of the assumptions and hypotheses embedded in each model. Inadequate models would expose inconsistencies with established experimental facts, and thus need to be rejected or modified. Models that pass this test become subjects of a thorough system analysis where a number of predictions may be made. A set of predictions that can distinguish a correct model among competing models is selected for experimental validation (called wet experiments by biologists). Successful experiments are those that eliminate inadequate models. Models that survive this cycle are deemed to be consistent with existing experimental evidence. While this is an idealized process of systems biology research, the hope is that advancement of research in computational science, analytical methods, technologies for measurements, and genomics/material informatics can transform research to fit this cycle for more systematic and hypothesis-driven science.

    4 Data Mining: Data-Driven Materials Research

    Broadly speaking, data-mining techniques have two primary functions, pattern recognition and prediction, both of which form the foundations for understanding materials behavior. Based on the treatment by Tan et al. (2004), the former, which is more descriptive in scope, serves as a basis for deriving correlations, trends, clusters, trajectory and anomalies among disparate data. The interpretation of these patterns is intrinsically tied to an understanding of materials physics and chemistry. In many ways this role of data mining is similar to the phenomenological structure–property paradigms that play a central role in the study of engineering materials, except now we will be able to recognize these relationships with far greater speed and not necessarily depend on a priori models, provided of course we have the relevant data. The predictive aspect of data-mining tasks can serve for both classification and regression operations. Data mining, which is an interdisciplinary blend of statistics, machine learning, artificial intelligence and pattern recognition, is considered to have a few core tasks:

    • Cluster analysis: Seeks to find groups of closely related observations and is valuable in targeting groups of data that may have well-behaved correlations and can form the basis of physics-based as well as statistically-based models. Cluster analysis, when integrated with high-throughput experimentation, can serve as a powerful tool for rapidly screening combinatorial libraries.

    • Predictive modeling: Helps build models for targeted objectives (e.g. a specific materials property) as a function of input or exploratory variables. The success of these models also helps refine the usefulness and relevance of the input parameters.

    • Association analysis: Used to discover patterns that describe strongly associated features in data (e.g. the frequency of association of a specific materials property to materials chemistry). Such an analysis over extremely large data sets is made possible with the development of very-high-speed search algorithms and can help to develop heuristic rules for materials behavior governed by many factors.

    • Anomaly detection: Does the opposite by identifying data or observations significantly different from the norm. The ability to identify such anomalies or outliers is critical in materials since it can serve to identify a new class of materials with an unusual property (e.g. superconducting ceramics as opposed to insulating ceramics) or anticipate potentially harmful effects that are often identified through a retrospective analysis after an engineering failure (e.g. ductile–brittle transition).

    In most materials science studies, we identify a priori likely variables or parameters that affect a set of properties. This is usually based on theoretical considerations and/or heuristic analysis based on prior experience. It is, however, difficult to integrate information simultaneously from multivariate data and especially when phenomenological relationships cannot always be explained in advance.

    As suggested by Ideker and Lauffenburger (2003), relationships between different components of information can be extracted from the scaffold using high-level computational models, which identify the key components, interactions, and influences required for more detailed low-level models. Large-scale experimental measurements validate high-level models, whereas targeted experimental manipulations and measurements test low-level models. The ultimate goal of knowledge discovery is achieved in systematic integration of data, correlation analysis developed through data-mining tools, and most importantly validated by fundamental theory- and experiment-based science of materials. The sources of data can be varied and numerous, ranging from computer simulations, high-throughput experimentation via combinatorial experiments, and large-scale databases of legacy information. The application of advanced data-mining tools permits the processing of very large sets of information in a robust yet rapid manner. The collective integration of statistical learning tools (the high-level models as shown in Figure 1.11) with experimental and computational materials science allows for an informatics driven strategy for materials design.

    Figure 1.11 Computational models of cellular processes span a wide range of levels of abstraction. At the highest level, statistical data-mining approaches correlate dependent and independent variables, elucidating model components and their potential interrelationships. At a somewhat lower level, Bayesian networks expand on these relationships by modeling conditional dependencies of child nodes on their parents in the network, whereas Boolean and fuzzy logic models dictate logical rules governing these dependencies. Finally, at a relatively detailed level, Markov chains allow probabilistic production, loss and interconversion among molecular species and states, and complex systems of differential equations explicitly model a wide range of materials science behavior ranging from electronic structure calculations (e.g. Schrödinger equation) to transport behavior (diffusion equations). (From Ideker and Lauffenburger, 2003.)

    Ultimately, the processing–structure–properties paradigm that forms the core of materials development is based on understanding multivariate correlations and their interpretation in terms of the fundamental physics, chemistry, and engineering of materials. The field of materials informatics can advance that paradigm in a significant manner. A few critical questions may be helpful to keep in mind in building the informatics infrastructure for materials science:

    • How can data mining/machine learning best be used to discover what attributes (or combination of attributes) in a material may govern specific properties. Using information from different databases, we can compare and search for associations and patterns that can lead to ways of relating information among these different data sets.

    • What are the most interesting patterns that can be extracted from the existing material science data? Such a pattern search process can potentially yield associations between seemingly disparate data sets, as well as establish possible correlations between parameters that are not easily studied experimentally in a coupled manner.

    • How can we use mined associations from large volumes of data to guide future experiments and simulations? How does one select, from a materials library, which compounds are most likely to have the desired properties? Data-mining methods should be incorporated as part of design and testing methodologies to increase the efficiency of material application process. For instance, a possible test bed for materials discovery can involve the use of massive databases on crystal structure, electronic structure, and thermochemistry. Each of these databases by themselves can provide information on over hundreds of binary, ternary, and multicomponent systems. Coupled to electronic structure and thermochemical calculations, one can enlarge this library to permit a wide array of simulations for thousands of combinations of materials chemistries. Such a massively parallel approach in generating new virtual data would be a daunting if not impossible task were it not for data-mining tools as proposed here.

    In this, the first chapter of the book, we have essentially provided a brief summary of some of the topics to be covered in the following chapters. We begin by providing an overview of some of the data-mining tools and vocabulary of the field of informatics. The following chapters examine how informatics or information processing of discrete data is being or can be applied to the field of materials science.

    References

    1. Aitchison JD, Galitski T. Inventories to insights. J Cell Biol. 2003;161(3):465–469.

    2. Ashby MF. Materials Selection in Mechanical Design Burlington, MA: Elsevier; 2011.

    3. Ashby MF, Greer AL. Metallic glasses as structural materials. Scr Mater. 2006;54:321–326.

    4. Butcher EC, Berg EL, Kunkel. EJ. Systems biology in drug discovery. Nat Biotechnol 2004; In: http://dx.doi.org/10.1038/nbt1017; 2004.

    5. Csete ME, Doyle JC. Reverse engineering of biological complexity. Science. 2002;295:1664–1669.

    6. Ideker T, Lauffenburger D. Building with a scaffold: Emerging strategies for high- to low-level cellular modeling. Trends Biotechnol. 2003;21:255–262.

    7. Ideker T, et al. Discovering regulatory signalling circuits in molecular interaction networks. Bioinformatics. 2002;18:S233–S240.

    8. King RD, Garrett SM, Coghill GM. On the use of qualitative reasoning to simulate and identify metabolic pathways. Bioinformatics. 2005;21:2017–2026.

    9. Kitano H. Systems biology: a brief overview. Science. 2002;295:1662–1664.

    10. Liu DR, Schultz PG. From generating new molecular function: a lesson from nature. Angew Chem Int Ed. 1999;38:36–54.

    11. Noor AK, Venneri SL, Paul DB, Hopkins MA. Structures technology for future aerospace systems. Comput Struct. 2000;74:507–519.

    12. Rajan K. Materials informatics. Mater Today. 2005;8:35–39.

    13. Stoll M, Cowley Jr AW, Tonellato PJ, et al. A genomic-systems biology map for cardiovascular function. Science. 2001;294:1723–1726.

    14. Strausberg RL, Schreiber SL. From knowing to controlling: a path from genomics to drugs using small molecule probes. Science. 2003;300:294–295.

    15. Tan P-N, Steinbach M, Kumar V. Introduction to Data Mining Addison-Wesley 2004.

    16. Wirth BD, Caturla MJ, Diaz de la Rubia T, Khraishi T, Zbi H. Mechanical property degradation in irradiated materials: a multiscale modeling approach. Nucl Instrum Methods Phys Res B. 2001;180:23–31.

    Chapter 2

    Data Mining in Materials Science and Engineering

    Chandrika Kamath and Ya Ju Fan,    Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

    1 Introduction

    Data mining techniques are increasingly being applied to data from scientific simulations, experiments, and observations with the aim of finding useful information in these data. These data sets are often massive and can be quite complex, in the form of structured or unstructured mesh data from simulations or sequences of images from experiments. They are often multivariate, such as data from different sensors monitoring a process or an experiment. The data can be at different spatial and temporal scales, for example as a result of simulations of a material being modeled at different scales to understand its behavior. In the case of experiments and observations, the data often have missing values and may be of low quality, for example images with low contrast, or noisy data from sensors. In addition to the size and the complexity of the data, we may also face challenges when the data are being analyzed for scientific discovery or in decision making. In the former, the domain scientists may not have a well-formulated question they want addressed, or they may want the data to be explored to determine if any insights could be gained in the analysis. In the latter case, in addition to the results of the analysis, scientists may need information on how much they can trust the results as they want to make decisions based on the analysis.

    In this chapter, we provide a brief introduction to the field of data mining, focusing on techniques that are useful in the analysis of data from materials science and engineering applications. Data mining is the process of uncovering patterns, associations, anomalies, and statistically significant structures and events in data. It borrows and builds on ideas from a diverse collection of fields, including statistics, machine learning, pattern recognition, mathematical optimization, as well as signal, image, and video analysis. The resulting literature in the area of data analysis techniques is therefore enormous, and solution approaches are often rediscovered in different fields where they are likely known by different names. In addition, scientists often create their own solutions when they can exploit properties of the data to reduce the time for analysis or improve the accuracy of the analysis.

    In light of this, we view this chapter as a starting point for someone interested in learning about the field of data mining and understanding the different categories of techniques available to address their data analysis problems in materials science. We begin the chapter by discussing the types of analysis problems often encountered in scientific domains, followed by a brief description of the analysis process. The bulk of the chapter focuses on different categories of analysis algorithms, including image analysis, dimension reduction, and the building of descriptive and predictive models. We conclude with some suggestions for further reading.

    First, some caveats. This chapter is written from the viewpoint of a data miner, not a materials scientist. The focus therefore is on algorithms rather than the insights into materials science one might gain from the application of these algorithms. Further, for the reasons mentioned earlier, the chapter barely scratches the surface of a broad, multidisciplinary field. Therefore, we recommend that the interested reader learn more about various techniques available to address their problem before selecting one for use with their data.

    2 Analysis Needs of Science Applications

    There are many ways in which scientific data-mining techniques are being used to analyze data from scientific simulations, experiments, and observations in a variety of domains ranging from astronomy to combustion, plasma physics, wind energy, and materials science. The tasks being addressed in these domains are often very similar, and data analysis problems in materials science can frequently be addressed using approaches developed in the context of a related problem in a different application domain.

    In the case of experimental data, the data are often in the form of one-dimensional signals or two-dimensional images. The data may be multivariate, for example the signals from several sensors monitoring a process, or may have a time component, such as a sequence of images taken over time. Data in the form of images are often analyzed to extract objects of interest and their characteristics, such as galaxies in astronomical surveys (Kamath et al., 2002) or fragments in images of material fragmentation (Kamath and Hurricane, 2011). Streaming data from sensors monitoring a process may be analyzed to determine if the process is evolving as expected; if something untoward is about to happen, prompting a shutdown; or if the process is moving from one normal state to another, requiring a change in the control parameters. Data analysis techniques are being used both in manufacturing (Harding et al., 2006), in materials development (Morgan and Ceder, 2005), and in the intelligent processing of materials (Wadley and Vancheeswaran, 1998), which integrates advanced sensors, process models, and feedback control concepts.

    In the case of simulation data, the output of the simulations can be analyzed to extract information on the phenomenon being modeled. A common task is to identify coherent structures in the data and extract statistics on these structures as they evolve over time. Other tasks include: sensitivity analysis (Saltelli et al., 2009) to understand how sensitive the outputs are to changes in the inputs of the simulation; uncertainty quantification (Committee on Mathematical Foundations of Verification, 2012) to understand how uncertainty in the inputs affects the outputs; and design and analysis of computer experiments (Fang et al., 2005), which uses the simulations to better understand the input space of the phenomenon. For example, if we are interested in creating materials with certain properties, we could use computer simulations to generate the properties for a sample of compounds. We could then analyze the inputs and outputs of these simulations to determine which inputs are more sensitive (and therefore must be sampled more finely), to build a data-driven model to predict the output given the inputs, and to place additional sample points at appropriate input values to create a more accurate predictive model.

    In some problems, we may need to analyze both simulation and experimental data. This is often the case in validation, where we compare how close a simulation is to an experiment by extracting statistics from both (Kamath and Miller, 2007). Or we may want to use the simulations to guide an experiment or understand the results of an experiment better.

    Though there is a wide variety of analysis tasks encountered in scientific data sets, the techniques used in the analysis are often very similar. For example, methods used to extract information from images may be very similar regardless of whether the images were obtained using a hand-held camera or a scanning electron microscope. Also, techniques to identify coherent structures in simulation output may be similar for structured and unstructured grids. We next discuss some of the analysis techniques relevant to problems in materials science and engineering. A detailed discussion of all techniques is beyond the scope of this chapter; instead, brief descriptions are provided, followed by suggestions for further reading.

    3 The Scientific Data-Mining Process

    The process of scientific data mining is usually a multistep process, with the techniques used in each step motivated by the type of data and the type of analysis being performed (Kamath, 2009). At a high level, we can consider the process to consist of five steps, as shown in Figure 2.1. The first step is to identify and extract the objects of interest in the data. In some problems this is relatively easy, for example when the objects are chemical compounds or proteins and we are provided data on each of the compounds or proteins. In other cases, it can be more complicated, for example when the data are in the form of images and we need to identify the objects (say, galaxies in astronomical images) and extract them from the background. Once the objects have been identified, we need to describe them using features or descriptors. These should reflect the analysis task. For example, if the task focuses on the structure or the shape of the objects, the descriptors must reflect the structure or the shape respectively. In many cases, one may extract far more features than is necessary, requiring a reduction in the number of features or the dimension of the problem. These key features are then used in the pattern recognition step and the patterns extracted are visualized for validation by the domain scientists.

    Figure 2.1 Scientific data analysis: an iterative and interactive process.

    The data analysis process is iterative and interactive; any step may lead to a refinement of one or more of the previous steps and not all steps may be used during the analysis of a particular data set. For example, in some problems, such as the analysis of coherent structures (Kamath et al., 2009) or images of material fragmentation (Kamath and Hurricane, 2011), the analysis task may be to extract the objects and statistics on them. This would require only the first two steps of the process. In other problems, we may be given the data set in the form of objects described by a set of features and the task may be to identify the important features using dimension reduction techniques. A few problems may require the complete end-to-end process, for example finding galaxies with a particular shape in astronomical images.

    In our experience, data analysis is a close collaboration between the analysis expert and the domain scientist who is actively involved in all steps, starting from the initial description of the data and the problem, the extraction of potentially relevant features, the identification of the training set where necessary in pattern recognition, and the validation of the results from each step.

    We next discuss various analysis techniques we have found to be broadly useful in analysis of data from scientific simulations, experiments, and observations.

    4 Image Analysis

    In this section, we use some of our prior work to describe the tasks in image analysis. We consider the analysis of images obtained from experiments investigating the fragmentation of materials (Kamath and Hurricane 2007, 2011). Figure 2.2A shows a subset of a larger image of a material as it fragments. The lighter regions are the fragments of the material, while the darker areas are the gaps between the fragments. The images were analyzed to obtain statistics for both the fragments (such as their size) and the gaps (such as their length and width). The distributions of these characteristics, in the form of histograms, were then used to provide a concise summary of each image.

    Figure 2.2 Processing of an image resulting from the fragmentation of a material. (A) Original image. (B) After the application of the Retinex algorithm to make the illumination uniform. (C) After smoothing to reduce the noise. (D) After segmentation to isolate the fragements from the background. (E) A zoomed-in view after cleanup following segmentation. (F) Identifying the skeleton of the gap regions.

    The approach used in the analysis was to first segment the fragments from the background (that is, the gaps) and then extract the statistics on both the fragments and the gaps. This is challenging for several reasons. First, there is a large variation in the intensity, with the top left corner being brighter than the lower right region. The intensity of fragments in the darker regions is similar to the intensity of the gaps in the brighter regions. Some fragments, especially those in the lower right corner, have a range of intensity values. The images are also quite grainy and there is no clear demarcation between the fragments and the

    Enjoying the preview?
    Page 1 of 1