Escolar Documentos
Profissional Documentos
Cultura Documentos
Presented by
Liew Guo Min
Zhao Jin
Outline
Recap
Special features
Running Nutch in a distributed environment
(with demo)
Q&A
Discussion
Recap
Complete web search engine
Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)
Features:
Customizable
Extensible
Distributed
Nutch as a crawler
Initial URLs
Injector Web
CrawlDB Webpages/files
update get
read/write
generate read/write
Segment Parser
Special Features
Extensible (Plugin system)
Most of the essential functionalities of Nutch
are implemented as plugins
Three layers
Extension points
What can be extended: Protocol, Parser, ScoringFilter, etc.
Extensions
The interfaces to be implemented for the extension points
Plugins
The actual implementation
Special Features
Extensible (Plugin system)
Anyone can write a plugin
Write the code
Prepare metadata files
Plugin.xml: what has been extended by what
Build.xml: how ant can build your source code
wiki.apache.org/nutch/PluginCentral
Special Features
Extensible (Plugin system)
To use a plugin
Make sure you have modified Nutch-site.xml to
include the plugin
Then, either
Nutch would automatically call it when needed, or
You can write something to call it with its classname and
then use it
Special Features
Distributed (Hadoop)
Map-Reduce (Diagram)
A framework for distributed programming
Map -- Process the splits of data to get
(Challenging) Modify Nutch such that you can unpack the crawled
files in the segments back into their original state
Reference
http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch
plugins
http://lucene.apache.org/hadoop/ -- Hadoop homepage
http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/map
"MapReduce in Nutch"
http://wiki.apache.org/nutch-
data/attachments/Presentations/attachments/oscon05.pdf "Scalable
Computing with MapReduce“
http://www.mail-archive.com/nutch-
commits@lucene.apache.org/msg01951.html Updated tutorial on setting
up Nutch, Hadoop and Lucene together
Excursion: MapReduce
Problem
Find the number of occurrences of “cat” in a
file
What if the file is 20GB large?
k1:v1
k1:v1,v2
Split 1 Worker k3:v2
Worker Output 1
Split 2 k2:v4,v5
Worker k1:v3
Split 3 Worker Output 2
k2:v4
Split 4 k3:v2
Worker Worker Output 3
k2:v5
k4:v6 k4:v6