Escolar Documentos
Profissional Documentos
Cultura Documentos
ANALYTICS AND
TECH MINING FOR
ENGINEERING
MANAGERS
SCOTT W. CUNNINGHAM
AND JAN H. KWAKKEL
Abstract
This book offers practical tools in Python to students of innovation as
well as competitive intelligence professionals to track new developments in science, technology, and innovation. The book will appeal to
bothtech-mining and data science audiences. For tech-mining audiences, Python presents an appealing, all-in-one language for managing
the tech-mining process. The book is a complement to other introductory
books on the Python language, providing recipes with which a practitioner
can grow a practice of mining text. For data science audiences, this book
gives a succinct overview of the most useful techniques of text mining.
The book also provides relevant domain knowledge from engineering
management; so, an appropriate context for analysis can be created.
This is the first book of a two-book series. This first book discusses
the mining of text, while the second one describes the analysis of text.
This book describes how to extract actionable intelligence from a variety
of sources including scientific articles, patents, pdfs, and web pages. There
are a variety of tools available within Python for mining text. In particular,
we discuss the use of pandas, BeautifulSoup, and pdfminer.
KEYWORDS
data science, natural language processing, patent analysis, Python, s cience,
technology and innovation, tech mining
Contents
List of Figures
xi
List of Tables
xiii
xv
xix
xxiii
1
2Python Installation
11
12
13
2.5 Packages
17
21
21
26
30
33
34
42
44
x Contents
47
47
51
53
54
56
61
62
67
71
75
77
78
79
82
83
89
90
91
96
100
Conclusions
103
References
111
Index
115
List of Figures
Figure 2.1.
Figure 2.2.
Figure 2.3.
Figure 2.4.
Figure 2.5.
Figure 2.6.
Figure 2.7.
Figure 2.8.
Figure 2.9.
Figure 2.10.
Figure 2.11.
Figure 2.12.
Figure 3.1.
Figure 3.2.
Figure 3.3.
Figure 3.4.
Figure 3.5.
Figure 3.6.
Figure 3.7.
Figure 4.1.
Figure 4.2.
Figure 4.3.
Figure 4.4.
Figure 4.5.
Figure 4.6.
Figure 4.7.
11
12
13
13
14
14
15
15
16
16
17
18
22
22
22
23
23
23
24
36
37
37
38
40
41
41
Figure 5.1.
Figure 5.2.
Figure 6.1.
Figure 8.1.
Figure 8.2.
Figure 8.3.
Figure 8.4.
Figure 8.5.
Figure 8.6.
Figure 8.7.
Figure 8.8.
Figure 8.9.
Figure C.1.
48
51
62
92
93
94
94
96
98
99
99
101
108
List of Tables
Table 1.1.
Table 1.2.
Table 1.3.
Table 4.1.
Table 4.2.
Table 4.3.
Table 7.1.
Table 7.2.
Table 7.3.
5
6
7
34
38
39
76
77
87
16
17
17
18
18
22
23
24
25
25
26
27
27
28
29
29
30
30
30
31
31
31
32
32
32
48
50
50
Example 5.3.
Parsing row-structured data
Example 5.4.
Adapting the parser for a new databases
Example 5.5.
Reading from a directory
Output 5.2.
Example output of reading from a directory
Example 5.6.
Loading and pretty-printing a JSON file
Output 5.3.
Sample dictionary of dictionaries
Example 5.7. Extracting a sample from the dictionary
of dictionaries
Output 5.4.
Displaying part of a sample record
Example 6.1.
XML to dictionary
Output 6.1.
Patent stored in a dictionary
Example 6.2.
Pretty-printing a dictionary
Output 6.2.
Sample pretty-printed output
Example 6.3.
Recursively printing a dictionary and its contents
Output 6.3.
Top of the patent
Output 6.4.
Cited literature in the patent
Output 6.5.
Description of the invention
Example 6.4.
BeautifulSoup example
Output 6.6.
Scraped HTML sample
Example 6.5.
Extracting readable text from HTML
Output 6.7.
Example readable text from HTML
Example 6.6.
Example use of PDFMiner
Output 6.8.
Sample PDF output to text
Example 6.7.
Get outlines method
Example 7.1.
Splitting a corpus
Output 7.1.
Results from splitting
Example 7.2.
Making a counter
Output 7.2.
Screen output from a counter
Example 7.3.
The most common method
Output 7.3.
The top 10 years
Example 7.4.
Counting authors
Output 7.4.
Top 10 authors
Example 7.5.
Counting nations
Output 7.5.
Top 10 nations
Example 7.6.
Extracting a dictionary
Example 7.7.
Loading a JSON
Example 7.8.
Fetching a field
Output 7.6.
Sample counter
Example 7.9.
Counting a field
Output 7.7.
The most frequent words
Example 8.1.
Selective indexing
51
53
55
55
56
57
58
58
63
63
63
64
64
65
66
66
68
68
69
70
71
72
73
77
78
78
79
79
79
80
80
81
81
83
83
84
84
85
86
90
Output 8.1.
Example 8.2.
Example 8.3.
Example 8.4.
Example 8.5.
Example 8.6.
Example 8.7.
Example 8.8.
Example 8.9.
Output 8.2.
Example 8.10.
Example 8.11.
Example 8.12.
Output 8.3.
Example 8.13.
Output 8.4.
Example 8.14.
Example 8.15.
91
92
93
94
94
95
95
96
97
97
97
98
98
100
101
101
101
102
Preface
The authors of this book asked me to share perspectives on tech mining.
I co-authored the 2004 book on the topic (Porter and Cunningham 2004).
With an eye toward Scott and Jans materials, here are some thoughts.
These are meant to stimulate your thinking about tech mining and you.
Who does tech mining? Experience suggests two contrasting types
of people: technology and data folks. Technology folks know the subject;
they are either experienced professionals or trained professionals or both,
working in that industry or research field to expand their intelligence via
tech mining. They seek to learn a bit about data manipulation and analytics to accomplish those aims. For instance, imagine a chemist seeking a
perspective on scientific opportunities or an electrical engineer analyzing
emerging technologies to facilitate open innovation by his or her company.
The data science folks are those whose primary skill include some variation of data science and analytics. I, personally, have usually been in this
groupneeding to learn enough about the subject under study to not be
totally unacquainted. Moreover, in collaborating on a major intelligence
agency project to identify emerging technologies from f ull-text analyses,
we were taken by the brilliance of the data folksreally impressive
capabilities to mine science, technology, and innovation text resources.
Unfortunately, we were also taken by their disabilities in relating those
analyses to real applications. They were unable to practically identify
emergence in order to provide usable intelligence.
So, challenges arise on both sides. But, a special warning to readers of this bookwe suspect you are likely Type B, and we fear that the
challenges are tougher for us. Years ago, we would have said the oppositeanalysts can analyze anything. Now, we think the other way; that
you really need to concentrate on relating your skills to answering real
questions in real time. My advice would be to push yourself to perform
hands-on analyses on actual tech-mining challenges. Seek out internships
xxPreface
or capstone projects or whatever, to orient your tech mining skills to generate answers to real questions, and to get feedback to check their utility.
Having said that, an obvious course of action is to team up Types A
and B to collaborate on tech-mining work. This is very attractive, but you
must work to communicate well. Dont invest 90 percent of your energy in
that brilliant analysis and 10 percent in telling about it. Think more toward
a 5050 process where you iteratively present preliminary results, and get
feedback on the same. Adjust your presentation content and mode to meet
your users needs, not just your notions of whats cool.
Whats happening in tech mining? The field is advancing. Its hard
for a biased insider like me to gauge this well, but check out the website www.VPInstitute.org. Collect some hundreds of tech-mining-oriented
papers and overview their content. You can quickly get a picture of the
diverse array of science, technology, and innovation topics addressed in
the open-source literature. Less visiblebut the major use of tech-mining
toolsare the competitive technical intelligence applications by companies and agencies.
Tech mining is advancing. In the 2000s, studies largely addressed
who, what, where, and when questions about an emerging technology.
While research profiling is still useful, we now look to go further along
following directions.
Assessing R&D in a domain of interest, to inform portfolio management or funding agency program review.
Generating competitive technological intelligence, to track known
competitors and to identify potential friends and foes. Tech mining
is a key tool to facilitate open innovation by identifying potential
sources of complementary capabilities and collaborators.
Technology road mapping by processing text resources (e.g., sets
of publication or patent records on a topic under scrutiny) to extract
topical content and track its evolution over time periods.
Contributing to future-oriented technology analysestech mining provides vital empirical grounding to inform future prospects.
Transition from identifying past trends and patterns of engagement
to laying out future possibilities is not automatic, and offers a field
for productive study.
Id point to some resources to track whats happening in tech mining as time progresses.
Note the globalization of tech-mining interest. For instance, this
book has been translated in Chinese (Porter and Cunningham
Preface xxi
xxiiPreface
well with Pajek (open source) to generate science and patent overlay maps
to show disciplinary participation in R&D areas under study (c.f., Kay
etal. 2014).
Alan Porter
Atlanta, Georgia
July 30, 2015
Acknowledgments
SWCThis work was partially funded by a European Commission grant,
grant number 619551.
JHKThis work was partially funded by the Dutch National Science
foundation, grant number 451-13-018.
CHAPTER 1
Nowadays, open source software spans the space from the operating
system (e.g., the Linux kernel) all the way to very specialized applications
like GIMP (a Photoshop alternative). Moreover, the idea of open source
has spread to other domains as well. For example, in electronics, Arduino
is based on open source principles, and a website like Wikipedia also used
open source ideals.
There are many programming languages available. Why are we
using Python in this book? There are several reasons why we have chosen
Python. First, Python is an open source software. The licenses under which
most Python libraries are being distributed are quite liberal. Therefore,
they can be distributed freely even in case of commercial applications. It
is also free, and can easily be acquired via the Internet. Python is platform
independent. It runs under Windows, Mac OSX, and virtually all Linux
distributions. Moreover, with a little care, programmers can write code
that will run without change on any of these operating systems.
Second, Python is a general purpose programming language. This
means that Python is designed to be used for developing applications in
a wide range of application domains. Other domain-specific languages
are much more specialized. For example, Matlab, frequently used in
engineering, is designed for matrix operations. Being a general purpose
programming language, Python can be used for, among other things,
string handling, mathematics, file input and output, connecting to databases, connecting over a network and to websites, and GUI development.
Python comes with a comprehensive standard library. The library contains
modules for graphical user interface development, for connecting to the
Internet as well as various standard relational databases, for handling regular expressions and for software testing. Next to the extensive standard
library, there are many libraries under active development for scientific
computing applications. This scientific computing community is vibrant
and actively developing bothcornerstone libraries for general scientific computing and domain-specific libraries. This implies that Python
is increasingly being seen as a viable open source alternative to many
established proprietary tools that have typically been used in science and
engineering applications.
Third, the language design of Python focuses on readability and
coherence. Quite often, it is hard to read code, even if you have written
it yourself a few weeks ago. In contrast, the language design of Python
strives to result in a code that is easy to read. Both collaboration and education benefit from the feature of readability. One of the most obvious
ways in which Python enforces readability is through its use of indentation
s uitable for further analysis. For instance you can create a set of articles
fully indexed by content. Then it becomes possible to filter and retrieve
your content. This often reveals surprising relationships in the data. You
can also compile organizational collaborations across the articles. Like
article indices, these collaboration tables are often inputs to data analysis
or machine-learning routines.
The final form of information products, which well consider here,
are cross-tabs. Cross-tabs mix two or more of the journalists questions
to provide more complex insight into a question of research and development management. For instance, a cross-tab that shows which organizations specialize in which content can be used for organizational profiling.
A decision-maker may use this as an input into questions of strategic
alliance. The variety of information products that we will be considering
in the book is shown in Table 1.2.
Lists include quick top 10 summaries of the data. For instance, a list
might be of top 10 most published authors in a given domain. These lists
should not be confused with the Python data structure known as a list.
Well be discussing this data structure in subsequent chapters. A table is a
complete record of information, indexed by a unique article id or a unique
patent id. Such a table might include the presence or absence of key terms
in an article. Another example of a table could include all the collaborating organizations, unique to each article. A cross-tab merges two tables
together to produce an informative by-product. For instance, we could
combine the content and organization tables to produce an organizational
profile indicating which organizations research what topics. Our usage of
list, table, and cross-tab is deliberately informal here.
The table shows the type of question being asked, as well as the form
of information product, resulting in a five by three table of possibility
(Table 1.3). Although we havent created examples of all the 15 kinds of
questions and products represented in this table, there is a representative
sample of many of these in the book to follow.
We now briefly introduce the book to follow. The next chapters,
Chapters 2 and 3, provide a quick start to the Python programming language. While there are many fine introductory texts and materials on
Table 1.2. Types of information product
Type of information product
List
Table
Cross-tab
Table
Cross-tab
Who
What
When
Where
Why
Python, we offer a quick start to Python in these two chapters. The chapters provide one standard way of setting up a text mining system, which
can get you started if you are new to Python. The chapters also provide
details on some of the most important features of the language, to get you
started, and to introduce some of the more detailed scripts in the book to
follow.
There is also a chapter on data understanding, which is Chapter 4 of
the book. This chapter covers sources of science, technology, and innovation information. There is a wealth of differently formatted files, but they
basically break down into row, column, and tree-structured data. During
the data mining process, cleaning and structuring the data is incredibly
important.
We provide two full chapters on the topic, Chapters 5 and 6, where
we guide you through processes of extracting data from a range of text
sources. Here, especially the differences between text mining and more
general data mining processes become apparent. These chapters introduce
the idea that text is structured in three major waysby rows, by columns,
and by trees. The tree format in particular leads us to consider a range of
alternative media formats including the pdf format and the web page.
The book concludes with Chapters 7 and 8, where we discuss producing informative lists and tables for your text data. These chapters walk
into the gradually increasing levels of complexity ranging from simple top
10 lists on your data, to full tables, and then to informative cross-tabs of
the data. These outputs are useful for both decision-makers as well as for
additional statistical analysis.
Index
A
altmetrics, 42
Anaconda, 12
ARPANET, 1
arrays, 18
B
BeautifulSoup library, 67
Web scraping, 6770
blogs, 42
Boolean query, 40
Bouzas, V., 30
C
Canopy, 12
citation
forward, 82
measures, 43
cited record field (CR), 39
column-formatted data, 35
reading, 4751
compound data structures, 3032
Continuum Analytics page, 12
corpus, 32, 48, 52, 54, 83
counter, 29, 8485
counter, making, 7879
CRISP-DM. See cross-industry
standard process for data
mining
cross-industry standard process for
data mining (CRISP-DM), 106
cross-tabs, 6, 76
creating, 96100
D
data
collecting and downloading,
3441
formats, 35
mining, 5, 7
unstructured, 35
data directory, 21
data structures, 2630, 105
compound, 3032
counter, 29
data transformation, 105
databases, 38
DataFrames, 18, 9196
reporting on, 100102
Delicious, 34, 43
Derwent, 38
describe() method, 101
Designated states (DS), 40
development environment, 1317
dictionaries, 27
corpus, 32
reading and parsing, 5456
reading and printing JSON
dictionary of, 5659
dictionary, defined, 48
distribution, Python, 12
E
Eclipse, 13. See also integrated
development environment
(IDE)
enumerate() method, 26
116Index
M
machine learning, 107
machine-accessible format, 35
Matplotlib, 18
MEDLINE database, 44
Mendeley, 34
module, 17
N
nanometer, 44
nanotechnology, 36, 44
National Library of Medicine, 40
Natural Language Tool Kit
(NLTK) package, 107
NetworkX, 18
Networkx package, 107
NLTK. See Natural Language Tool
Kit
NumPy package, 18
O
open source software, 1
output directory, 21
P
package, 1719
NetworkX, 18
NumPy, 18
Pandas, 18
Pip, 18
Scikit-learn, 18
SciPy, 18
Pandas package, 18
dataframes, 91, 95
parsers
pdf files, 71
PubMed, 5354
partial index, 9091
pdf files
mining content, 7174
parsing, 71
pdfminer3k, 71
Pip package, 18
PMID, 52
Index 117
technology, 33
text analytics, 106108
text mining, 3
input organization, 21
outputs organization, 21
python part of, 109110
text visualization, 107
Text-Mining-Repository, 9
Thomson Data Analyzer, 110
to_csv() method, 102
tree-formatted data, 35
Twitter, 18, 34, 43
U
University of Science and
Technology of China, 99
unstructured data, 35
urllib module, 6768
V
VantagePoint software, 110
W
walk method, 55
Web of Science, 35
web pages, 42
Web scraping, 67
BeautifulSoup library, 6770
Wikipedia, 34, 43
wildcards, 35. See also Web of
Science
words, counting, 8388
X
XML. See Extensible Mark-up
Language
Z
zip method, 50