Escolar Documentos
Profissional Documentos
Cultura Documentos
Cloudera, Inc.
220 Portage Avenue
Palo Alto, CA 94306
info@cloudera.com
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Important Notice
2010-2013 Cloudera, Inc. All rights reserved.
Cloudera, the Cloudera logo, Cloudera Impala, Impala, and any other product or service names or
slogans contained in this document, except as otherwise disclaimed, are trademarks of Cloudera and its
suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior
written permission of Cloudera or the applicable trademark holder.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other
trademarks, registered trademarks, product names and company names or logos mentioned in this
document are the property of their respective owners. Reference to any products, services, processes or
other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute
or imply endorsement, sponsorship or recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights
under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval
system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise), or for any purpose, without the express written permission of Cloudera.
Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Cloudera, the furnishing of this document does not give you any license to these
patents, trademarks copyrights, or other intellectual property.
The information in this document is subject to change without notice. Cloudera shall not be liable for
any damages resulting from technical errors or omissions which may be present in this document, or
from use of this document.
Version: Cloudera Search Beta, version 0.9.0
Date: June 4, 2013
Contents
ABOUT THIS GUIDE ................................................................................................................................................ 1
GUIDELINES FOR DEPLOYING CLOUDERA SEARCH ................................................................................................. 1
THE IMPORTANCE OF USE CASE DEFINITION .................................................................................................................... 1
CLOUDERA SEARCH REQUIREMENTS ..................................................................................................................... 3
CDH REQUIREMENT ................................................................................................................................................ 3
OPERATING SYSTEMS ............................................................................................................................................... 4
JDK..................................................................................................................................................................... 5
PORTS USED BY CLOUDERA SEARCH ............................................................................................................................ 5
Ports Used by Cloudera Search ....................................................................................................................... 5
INSTALLING CLOUDERA SEARCH ............................................................................................................................ 5
Will there be an advanced search capability? If so, what does that look like? Can the Cloudera
Search count on users be more motivated than e-commerce users? There are significant design
decisions that need to be made depending on how motivated the users are. That is:
o Can users be expected to take some time to learn about the system? Advanced
screens are usually intimidating to e-commerce users but may be the best choice when
users can be expected to take some time to learn them.
o How long can your users be patient? Data mining can mean that users can wait multiple
seconds for search results. Of course, you dont want users to wait any longer than
necessary, but theres another set of design decisions related to reasonable response
times.
o How many simultaneous users?
Update requirements. An update in Solr refers both to adding new documents and changing
existing documents.
o Loading new documents.
Bulk. Are there use-cases where the index has to be rebuilt from scratch? Or will
there be an initial load?
Incremental. What is the rate of new documents coming into the system?
o Updating documents. Can you characterize the expected number of modifications to
existing documents?
o How much latency is acceptable in terms of the time when a document is added to Solr
and its available for search?
Security requirements. Solr has no built-in security options. In Solr, document-level security is
usually best accomplished by indexing some kind of authorization token(s) along with the
document. The number of authorization tokens applied to a document is largely irrelevant;
thousands are reasonable although such large numbers usually are a nightmare to administer.
The number of authorization tokens associated with a particular user should be much smaller.
100 or so is a good straw-man upper limit. The reason for this is that security at this level is
usually enforced by appending an fq clause to the query and putting thousands of tokens in an
fq clause is expensive.
o There exists a post filter (aka no-cache) filter that can help with access schemes that
cant use the first option. These are not cached and are applied only after all the less-
expensive filters are applied.
o If grouping, faceting isnt required to accurately reflect the true document counts, some
shortcuts can be taken. For example, ACL filtering is notoriously expensive in some
systems, sometimes requiring database access. If accurate faceting is required, you
cannot stop processing partway through the list and still reflect accurate facets.
Required query rate, usually measured in queries-per-second.
o Note that you must size the machines to give a reasonable response rate for a single
user. Its possible to put so much strain on a machine that the target hardware cannot
satisfy even a few users, in which case re-sharding is necessary.
o Absent needing to re-shard, increasing Solrs QPS rate is usually a matter of adding more
replicas to each shard.
o Zillions of shards can show the laggard issue. As the number of shards increases, the
probability that one of them will be anomalously slow increases. The QPS rate will
generally fall, though very slowly, as the number of shards gets into the hundreds.
Operating Systems
Cloudera Search provides packages for Red-Hat-compatible, SLES, Ubuntu, and Debian systems as
described below.
SLES
SLES Linux Enterprise Server (SLES) 11 with Service Pack 1 or later 64-bit
Ubuntu/Debian
Notes
For production environments, 64-bit packages are recommended. Except as noted above,
Cloudera Search provides only 64-bit packages.
Cloudera has received reports that our RPMs work well on Fedora, but we have not tested
this.
If you are using an operating system that is not supported by Cloudera's packages, you can
JDK
Cloudera Search requires Oracle JDK 1.6. Cloudera recommends version 1.6.0_31. The minimum
supported version is 1.6.0_8. See Java Development Kit Installation for more information.
Note
Non-SolrCloud mode has been deprecated and is no longer supported.
solr Solr/SolrCloud
solr-server Platform specific service script for starting, stopping, or restart Solr.
Important
Running services: When starting, stopping, and restarting CDH components, always use the
service (8) command rather than running /etc/init.d scripts directly. This is
important because service sets the current working directory to the root directory (/)
and removes environment variables except LANG and TERM. This creates a predictable
environment in which to administer the service. If you use /etc/init.d scripts directly,
any environment variables continue to be applied, potentially producing unexpected
results. If you install CDH from packages, service is installed as part of the Linux Standard
Base (LSB).
curl http://repos.jenkins.sf.cloudera.com/solr-beta-
nightly/redhat/5/x86_64/search/cloudera-search.repo | sudo tee
/etc/yum.repos.d/cloudera-search.repo
sudo yum clean all
You can see that the Cloudera Search packages has been configured to conform to the Linux Filesystem
Hierarchy Standard. (To learn more, run man hier).
You are now ready to enable the server daemons you want to use with Hadoop. You can also enable
Java-based client access by adding the JAR files in /usr/lib/solr/ and /usr/lib/solr/lib/ to
your Java class path.
Install and start the ZooKeeper service by running the commands shown in the "Installing the ZooKeeper
Server Package and Starting ZooKeeper on a Single Server" section of Installing the Zookeeper Packages.
SOLR_ZK_ENSEMBLE=<zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr
SOLR_HDFS_HOME=hdfs://namenodehost:8020/solr
Be sure to replace namenodehost with the hostname of your HDFS NameNode (as specified by
fs.default.name or fs.defaultFS in your conf/core-site.xml file); you may also need
to change the port number from the default (8020). On an HA-enabled cluster, you will need to
ensure that the HDFS URI you use reflects the designated nameservice utilized by your cluster.
This value should be reflected in fs.default.name; instead of a hostname, you would see
hdfs://nameservice1 or something similar.
2. In some cases, such as for configuring Solr to work with HDFS High Availability (HA), you may
want to configure Solr's HDFS client. You can do this by setting the HDFS configuration directory
in /etc/default/solr. Locate the appropriate HDFS configuration directory on each node,
and edit the following property with the absolute path to this directory. Do this on every Solr
Server host:
SOLR_HDFS_CONFIG=/etc/hadoop/conf
Be sure to replace the path with the correct directory containing the proper HDFS configuration
files, core-site.xml and hdfs-site.xml.
b. Make sure that the solr.keytab file is only readable by the solruser
3. Add Kerberos related settings to /etc/default/solr on every node in your cluster, substituting
appropriate values:
SOLR_KERBEROS_ENABLED=true
SOLR_KERBEROS_KEYTAB=/etc/solr/conf/solr.keytab
SOLR_KERBEROS_PRINCIPAL=solr/fully.qualified.domain.name@YOUR-
REALM.COM
$ solrctl init
WARNING
It must be noted that solrctl init takes a --force option as well. solrctl init --force will
clear the Solr data in ZooKeeper and interfere with any running nodes. If you want to clear Solr data
from ZooKeeper to start over, be sure to stop the cluster first.
After you have started the Cloudera Search Server, the Solr server should be up and running. You can
verify that all daemons are running using the jps tool from the Oracle JDK, which you can obtain from
the Java SE Downloads page. If you are running a pseudo-distributed HDFS installation and a Solr search
installation on one machine, jps will show the following output:
$ solrctl --help
Options:
--solr solr_uri
--zk zk_ensemble
--help
--quiet
Commands:
init [--force]
WARNING
If Cloudera Manager is managing the cluster, the --zk option must be specified appropriately.
Configuration files for a collection are managed as part of the instance directory. To generate a skeleton
of the instance directory run:
You can customize it by directly editing the solrconfig.xml and schema.xml files that have been
created in $HOME/solr_configs/conf.
These configuration files are compatible with the standard Solr tutorial example documents.
Once you are satisfied with the configuration, you can make it available for Solr to use via the following
command that will upload the content of the entire instance directory to ZooKeeper:
You may also use the solrctl tool to verify that your instance directory uploaded successfully and is
available via ZooKeeper:
which should return a list of instance directory names. For example, "collection1" in this case.
Important
If you are familiar with Apache Solr, you may be tempted to configure a collection directly in solr
home: /var/lib/solr. While this is possible, it is discouraged and the use of solrctl is
recommended instead.
4. Verify the collection is live and that your one shard is being served by 2 nodes:
http://localhost:8983/solr/#/~cloud
For information on using the Flume Solr Sink, see the Flume Near Real-Time Indexing Reference in the
Cloudera Search User Guide.
For information on using MapReduce to batch index documents see the MapReduce Batch Indexing
Reference in the Cloudera Search User Guide.
Contents
Upgrading Cloud Search from SolrCloud mode
Upgrading Cloud Search from Non-SolrCloud mode
Upgrading Cloudera Search involves stopping Cloudera Search services, using your operating system's
package management tool to upgrade Cloudera Search to the latest version, and then restarting
Cloudera Search services.
Requirements
Before attempting any upgrades it is extremely important to make backup copies of the following
configuration files:
/etc/default/solr
/var/lib/solr/solr.xml
2. Upgrade the packages. To upgrade the packages, follow the instructions in the "Installing
Cloudera Search" section of the Installing and Using Cloudera Search guide. Do NOT run yum
update.
4. Upgrade the packages. To upgrade the packages, follow the instructions in the "Installing
Cloudera Search" section of the Installing and Using Cloudera Search guide. Do NOT run yum
update.
$ solrctl init
Requirements
It is extremely important NOT to start the upgraded SolrServer service before completing
this step.
Importing Collections
The following screenshot is an example of the collection import feature within Hue.
Generally, only collections should be imported. Importing cores is rarely useful since it enables querying
a shard of the index. See A little about SolrCores and Collections for more information.
User UI
The following screenshot is an example of the appearance of the Search application that is integrated
with the Hue user interface.
Customization UI
The following screenshot is an example of the appearance of the Search application customization
interface provided in Hue.
$ cd /usr/share/hue
$ sudo tar -xzvf hue-search-####.tar.gz
$ sudo /usr/share/hue/tools/app_reg/app_reg.py --install
/usr/share/hue/apps/search
[search]
## URL of the Solr Server
solr_url=http://SOLR_HOST:8983/solr
Specify the Solr URL in /etc/hue/hue.ini. For example, to use localhost as your Solr host,
you would add the following:
[search]
# URL of the Solr Server, replace 'localhost' if Solr is
running on another host
solr_url=http://localhost:8983/solr/
4. (Optional) To view files on HDFS, ensure the correct webhdfs_url is included in hue.ini and
WebHdfs is properly configured as described in Configuring CDH Components for Hue.
5. Restart Hue:
$ cd /usr/share/hue
$ sudo tar -xzvf hue-search-####.tar.gz
$ sudo /usr/share/hue/tools/app_reg/app_reg.py --install
/usr/share/hue/apps/search
2. Restart Hue:
Result snippet editor and preview, function for downloading, extra css/js, labels, and field
picking assist.
Show multi-collections.
Show highlighting of search term.