Nutch in a Nutshell: Complete Web Search Engine

Nutch in a Nutshell
Presented by
Liew Guo Min
Zhao Jin
Outline
 Recap
 Special features
 Running Nutch in a distributed environment
(with demo)
 Q&A
 Discussion
Recap
 Complete web search engine
 Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)
 Java based, open source
 Features:
 Customizable
 Extensible
 Distributed
Nutch as a crawler
Initial URLs
Injector Web
CrawlDB Webpages/files
update get
Generator CrawlDBTool Fetcher
read/write
generate read/write
Segment Parser
Special Features
 Extensible (Plugin system)
 Most of the essential functionalities of Nutch
are implemented as plugins
 Three layers
 Extension points
 What can be extended: Protocol, Parser, ScoringFilter, etc.
 Extensions
 The interfaces to be implemented for the extension points
 Plugins
 The actual implementation
Special Features
 Anyone can write a plugin
 Write the code
 Prepare metadata files
 Plugin.xml: what has been extended by what
 Build.xml: how ant can build your source code
 Ask nutch to include your plugin in conf/nutch-

site.xml
 Tell ant to build your in src/plugin/build.xml
 More details @ http://
wiki.apache.org/nutch/PluginCentral
Special Features
 To use a plugin
 Make sure you have modified Nutch-site.xml to
include the plugin
 Then, either
 Nutch would automatically call it when needed, or
 You can write something to call it with its classname and
then use it
Special Features
 Distributed (Hadoop)
 Map-Reduce (Diagram)
 A framework for distributed programming
 Map -- Process the splits of data to get
intermediate results and the keys to indicate what

should be put together later
 Reduce -- Process the intermediate results with
the same key and output final result

Special Features
 MapReduce in Nutch
 Example1: Parsing
 Input: <url, content> files from fetch
 Map(url,content)  <url, parse> by calling parser plugins
 Reduce is identity
 Example2: Dumping a segment

 Input: <url, CrawlDatum>, <url, ParseText> etc. files from
segment
 Map is identity
 Reduce(url, value*)  <url, ConcatenatedValue> by simply
concatenating the text representation of values
Special Features
 Distributed File system
 Write-once-read-many coherence model
 High throughput
 Master/slave
 Simple architecture
 Single point of failure
 Transparent
 Access via Java API
 More info @ http://lucene.apache.org/hadoop/hdfs_design.html
Running Nutch in a distributed
environment
 MapReduce
 In hadoop-site.xml
 Specify job tracker host & port
 mapred.job.tracker
 Specify task numbers
 mapred.map.tasks
 mapred.reduce.tasks
 Specify location for temporary files

 Mapred.local.dir
Running Nutch in a distributed
environment
 DFS
 In hadoop-site.xml
 Specify namenode host, port & directory
 fs.default.name
 dfs.name.dir
 Specify location for files on each datanode

 dfs.data.dir
Demo time!
Q&A
Discussion
Exercises
 Hands-on exercises
 Install Nutch, crawl a few webpages using the crawl command and
perform a search on it using the GUI
 Repeat the crawling process without using the crawl command
 Modify your configuration to perform each of the following crawl jobs

and think when they would be useful.
 To crawl only webpages and pdfs but not anything else
 To crawl the files on your harddisk
 To crawl but not to parse
 (Challenging) Modify Nutch such that you can unpack the crawled
files in the segments back into their original state
Reference
 http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch
plugins
 http://lucene.apache.org/hadoop/ -- Hadoop homepage
 http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki
 http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/map
"MapReduce in Nutch"
 http://wiki.apache.org/nutch-
data/attachments/Presentations/attachments/oscon05.pdf "Scalable
Computing with MapReduce“
 http://www.mail-archive.com/nutch-
commits@lucene.apache.org/msg01951.html Updated tutorial on setting
up Nutch, Hadoop and Lucene together
Excursion: MapReduce
 Problem
 Find the number of occurrences of “cat” in a
file
 What if the file is 20GB large?
Why not do it with more computers?

 Solution
Split 1 PC1 200 PC1 500
File
Split 2
PC2 300
 Problem
 Find the number of occurrences of both “cat”
and “dog” in a very large file
 Solution
PC1 cat:
200,200, cat: 200,
Split 1 PC1 cat:500
dog:
250 250 300
File
Split 2
PC2 cat:
300,300, dog: 250, PC2 dog:500
dog:
250 250 250
Map Sort/Group Reduce
Input Files Intermediate files Output files

 Generalized Framework
Master
k1:v1
k1:v1,v2
Split 1 Worker k3:v2
Worker Output 1
Split 2 k2:v4,v5
Worker k1:v3
Split 3 Worker Output 2
k2:v4
Split 4 k3:v2
Worker Worker Output 3
k2:v5
k4:v6 k4:v6
Map Sort/Group Reduce
Input Files Intermediate files Output files

back

Nutch in a Nutshell: Complete Web Search Engine

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Nutch in a Nutshell: Complete Web Search Engine

Enviado por

Direitos autorais:

Formatos disponíveis

Nutch in a Nutshell

 Java based, open source

Generator CrawlDBTool Fetcher

 Ask nutch to include your plugin in conf/nutch-

 More details @ http://

intermediate results and the keys to indicate what

the same key and output final result

 Example2: Dumping a segment

 Specify location for temporary files

 Specify location for files on each datanode

 Repeat the crawling process without using the crawl command

 Modify your configuration to perform each of the following crawl jobs

Why not do it with more computers?

Map Sort/Group Reduce

Input Files Intermediate files Output files

Map Sort/Group Reduce

Input Files Intermediate files Output files

Você também pode gostar