Escolar Documentos
Profissional Documentos
Cultura Documentos
1) How will you decide whether you need to use the Capacity Scheduler or
the Fair Scheduler?
Fair Scheduling is the process in which resources are assigned to jobs such
that all jobs get to share equal number of resources over time. Fair Scheduler
can be used under the following circumstances -
i) If you wants the jobs to make equal progress instead of following the FIFO
order then you must use Fair Scheduling.
ii) If you have slow connectivity and data locality plays a vital role and makes
a significant difference to the job runtime then you must use Fair Scheduling.
iii) Use fair scheduling if there is lot of variability in the utilization between
pools.
ii) CS's memory based scheduling method is useful if the jobs have varying
memory requirements.
iii) If you want to enforce resource allocation because you know very well
about the cluster utilization and workload then use Capacity Scheduler.
FIFO Scheduler – This scheduler does not consider the heterogeneity in the system
but orders the jobs based on their arrival times in a queue.
COSHH- This scheduler considers the workload, cluster and the user heterogeneity
for scheduling decisions.
Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains
a number of map and reduce slots on a resource. Each user can use their own pool
to execute the jobs.
5) List few Hadoop shell commands that are used to perform a copy
operation.
fs –put
fs –copyToLocal
fs –copyFromLocal
jps command is used to verify whether the daemons that run the Hadoop
cluster are working or not. The output of jps command shows the status of
the NameNode, Secondary NameNode, DataNode, TaskTracker and
JobTracker.
Memory-System’s memory requirements will vary between the worker services and
management services based on the application.
Operating System - a 64-bit operating system avoids any restrictions to be imposed
on the amount of memory that can be used on worker nodes.
Capacity- Large Form Factor (3.5”) disks cost less and allow to store more, when
compared to Small Form Factor disks.
Only one.
9) What happens when the NameNode on the Hadoop cluster goes down?
10) What is the conf/hadoop-env.sh file and which variable in the file should
be set for Hadoop to work?
This file provides an environment for Hadoop to run and consists of the
following variables-HADOOP_CLASSPATH, JAVA_HOME and
HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to
run.
11) Apart from using the jps command is there any other way that you can
check whether the NameNode is working or not.
12) In a MapReduce system, if the HDFS block size is 64 MB and there are
3 files of size 127MB, 64K and 65MB with FileInputFormat. Under this
scenario, how many input splits are likely to be made by the Hadoop
framework.
2 splits each for 127 MB and 65 MB files and 1 split for the 64KB file.
16) I want to see all the jobs running in a Hadoop cluster. How can you do
this?
Using the command – Hadoop job –list, gives the list of jobs running in a
Hadoop cluster.
17) Is it possible to copy files across multiple clusters? If yes, how can you
accomplish this?
Yes, it is possible to copy files across multiple Hadoop clusters and this
can be achieved using distributed copy. DistCP command is used for intra
or inter cluster copying.
If the user does not want to compress the data for a particular job then he
should create his own configuration file and set
the mapred.output.compress property to false. This configuration file then
should be loaded as a resource into the job.
21) What is the best practice to deploy a secondary NameNode?
23) If Hadoop spawns 100 tasks for a job and one of the job fails. What does
Hadoop do?
The task will be started again on a new TaskTracker and if it fails more than
4 times which is the default setting (the default value can be changed), the
job will be killed.
24) How can you add and remove nodes from the Hadoop cluster?
To add new nodes to the HDFS cluster, the hostnames should be added to the slaves
file and then DataNode and TaskTracker should be started on the new node.
To remove or decommission nodes from the HDFS cluster, the hostnames should be
removed from the slaves file and –refreshNodes should be executed.
25) You increase the replication level but notice that the data is under
replicated. What could have gone wrong?
Nothing could have actually wrong, if there is huge volume of data because
data replication usually takes times based on data size as the cluster has to
copy the data and it might take a few hours.
26) Explain about the different configuration files and where are they located.
The configuration files are located in “conf” sub directory. Hadoop has 3
different Configuration files- hdfs-site.xml, core-site.xml and mapred-site.xml
2. How will you install a new component or add a service to an existing Hadoop
cluster?
3. If Hive Metastore service is down, then what will be its impact on the Hadoop
cluster?
4. How will you decide the cluster size when setting up a Hadoop cluster?
5. How can you run Hadoop and real-time processes on the same cluster?
6. If you get a connection refused exception - when logging onto a machine of the
cluster, what could be the reason? How will you solve this issue?
8. How can you decide the heap memory limit for a NameNode and Hadoop Service?
9. If the Hadoop services are running slow in a Hadoop cluster, what would be the root
cause for it and how will you identify it?
12. In case of high availability, if the connectivity between Standby and Active
NameNode is lost. How will this impact the Hadoop cluster?
13. What is the minimum number of ZooKeeper services required in Hadoop 2.0 and
Hadoop 1.0?
14. If the hardware quality of few machines in a Hadoop Cluster is very low. How will
it affect the performance of the job and the overall performance of the cluster?
16. Explain the difference between blacklist node and dead node.
19. After restarting the cluster, if the MapReduce jobs that were working earlier are
failing now, what could have gone wrong while restarting?
20. Explain the steps to add and remove a DataNode from the Hadoop cluster.
21. In a large busy Hadoop cluster-how can you identify a long running job?
23. When configuring Hadoop manually, which property file should be modified to
configure slots?
25. What is the advantage of speculative execution? Under what situations, Speculative
Execution might not be beneficial?
Open Ended Hadoop Admin Interview Questions
These interview questions are asked on a case by case basis, depending on
– where you are applying for the role of a Hadoop admin, do you have prior
experience at this role, etc. Please do share your Hadoop Admin interview
experience in the comments below.
1. Describe your Hadoop journey with your roles and responsibility in the present
project?
2. Which tool have you used in your project for monitoring clusters and nodes in
Hadoop?
6. Explain any notable Hadoop use case by a company, that helped maximize its
profitability?
10. Which tool will you prefer to use for monitoring Hadoop and HBase clusters?
The above list just gives an overview on the different types of Hadoop Admin
Interview questions that can be asked. However, the Hadoop Admin
Interview questions can purely vary and change based on your working
experience and the business domain you come from. Do not worry if you are
inexperienced, as companies would love to hire you if you are clear with your
basics and have hands-on experience in working on Hadoop projects. The
foremost thing to get started on, is to prepare for a great career in Hadoop
Administration and one can definitely succeed in nailing a Hadoop Admin
Interview. Strive for excellence and success will follow.
We would love to answer any questions you have in honing your Hadoop
skills for a lucrative career, please leave a comment below.
Htn