Você está na página 1de 10

For more/latest interview questions visit : http://www.interviewquestionspdf.

com/
50 APACHE CASSANDRA INTERVIEW QUESTIONS WITH ANSWERS
FRESHER AND EXPERIENCED
Here we come with rapidly growing technology Cassandra's/NoSQL related interview
questions answers. Which will help you in your interview, here are most impotent 50
interview questions with answers. If you like this set then please comment.
You can also download pdf, pdf link available at the end of post.
1). What is Apache Cassandra?
Ans: Apache Cassandra is an open source data storage system developed at Facebook
for inbox search and designed for storing and managing large amounts of data across
commodity servers. It can server as both

Real time data store system for online applications


Also as a read intensive database for business intelligence system

OR
Apache Cassandra is an open source, distributed and decentralized/distributed storage
system (database), for managing very large amounts of structured data spread out
across the world. It provides highly available service with no single point of failure.It
was developed at Facebook for inbox search and it was open-sourced by Facebook in
July 2008.
2). What was the design goal of Cassandra?
Ans: The design goal of Cassandra is to handle big data workloads across multiple
nodes without any single point of failure.
3). What is NoSQLDatabase?
Ans: NoSQL database (sometimes called as Not Only SQL) is a database that
provides a mechanism to store and retrieve data other than the tabular relations used in
relational databases. These databases are schema-free, support easy replication, have
simple API, eventually consistent, and can handle huge amounts of data.
4). Cassandra is written in which language?
Ans: Java
5). How many types of NoSQL databases?
Ans:

Document Stores (MongoDB, Couchbase)


Key-Value Stores (Redis, Volgemort)

Column Stores (Cassandra)


Graph Stores (Neo4j, Giraph)

6). What do you understand by composite type?


Ans: Composite Type is a cool feature of Hector and Cassandra.
It allow to define a key or a column name with a concatenation of data of different
type.
With CassanraUnit, you can use CompositeType in 2 places :

row key
column name

7). What is the difference between Cassandra's schema and RDBMS schema?
Ans: http://www.interviewquestionspdf.com/2015/10/cassandra-interviewquestions.html
8). What is the relationship between Apache Hadoop, HBase, Hive and
Cassandra?
Ans: Apache Hadoop, File Storage, Grid Compute processing via Map Reduce.
Apache Hive, SQL like interface ontop of hadoop.
Apache Hbase, Column Family Storage built like BigTable
Apache Cassandra, Column Family Storage build like BigTable with Dynamo
topology and consistency.
9). List out some key features of Apache Cassandra?
Ans: It is scalable, fault-tolerant, and consistent.
It is a column-oriented database.
Its distribution design is based on Amazons Dynamo and its data model on Googles
Bigtable.
Created at Facebook, it differs sharply from relational database management systems.
Cassandra implements a Dynamo-style replication model with no single point of
failure, but adds a more powerful column family data model.
Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.
10). What do you understand by Data Replication in Cassandra?
Ans: Database replication is the frequent electronic copying data from a database in
one computer or server to a database in another so that all users share the same level
of information.
Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A
replication strategy determines the nodes where replicas are placed. The total number
of replicas across the cluster is referred to as the replication factor. A replication

factor of 1 means that there is only one copy of each row on one node. A replication
factor of 2 means two copies of each row, where each copy is on a different node. All
replicas are equally important; there is no primary or master replica. As a general rule,
the replication factor should not exceed the number of nodes in the cluster. However,
you can increase the replication factor and then add the desired number of nodes later.
11). What do you understand by Node in Cassandra?
Ans: Node is the place where data is stored.
12). What do you understand by Data center in Cassandra?
Ans: Data center is a collection of related nodes.
13). What do you understand by Cluster in Cassandra?
Ans: Cluster is a component that contains one or more data centers.
14). What do you understand by Commit log in Cassandra?
Ans: Commit log is a crash-recovery mechanism in Cassandra. Every write operation
is written to the commit log.
15). What do you understand by Mem-table in Cassandra?
Ans: Mem-table is a memory-resident data structure. After commit log, the data will
be written to the mem-table. Sometimes, for a single-column family, there will be
multiple mem-tables.
16). What do you understand by SSTabl in Cassandra?
Ans: SSTable is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
17). What do you understand by Bloom filter in Cassandra?
Ans: Bloom filter are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters
are accessed after every query.
18). What do you understand by CQL?
Ans: User can access Cassandra through its nodes using Cassandra Query Language
(CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use
cqlsh: a prompt to work with CQL or separate application language drivers.
19). What do you understand by Column Family?
Ans: Column family is a container for an ordered collection of rows. Each row, in
turn, is an ordered collection of columns.

20). What is the use of "void close()" method?


Ans: This method is used to close the current session instance.
21). What is the use of "ResultSet execute(Statement statement)" method?
Ans: This method is used to execute a query. It requires a statement object.
22). Which command is used to start the cqlsh prompt?
Ans: Cqlsh
23). What is the use of "cqlsh --version" command?
Ans: This command will provides the version of the cqlsh you are using.
24). What are the collection data types provided by CQL?
Ans: List : A list is a collection of one or more ordered elements.
Map : A map is a collection of key-value pairs.
Set : A set is a collection of one or more elements.
25). What is Cassandra database used for?
Ans: Apache Cassandra is a second-generation distributed database originally opensourced by Facebook. Its write-optimized shared-nothing architecture results in
excellent performance and scalability. The Cassandra storage cluster and S3 archival
layer are designed to expand horizontally to any arbitrary size with linear
cost.Cassandras memory footprint is more dependent on the number of column
families than on the size of the data set. Cassandra scales pretty well horizontally for
storage and IO, but not for memory footprint, which is tied to your schema and your
cache settings regardless of the size of your cluster. some of the important link about
casandara is available-here.
26). What is the syntax to create keyspace in Cassandra?
Ans: Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE <identifier> WITH <properties>
27). What is a keyspace in Cassandra?
Ans: In Cassandra, a keyspace is a namespace that determines data replication on
nodes. A cluster consist of one keyspace per node.
28). What is cqlsh?
Ans: cqlsh is a Python-based command-line client for cassandra.
29). Does Cassandra works on Windows?

Ans: Yes, Cassandra works pretty well on windows. Right now we have linux and
windows compatible versions available.
30). What do you understand by Consistency in Cassandra?
Ans: Consistency means to synchronize and how up-to-date a row of Cassandra data
is on all of its replicas.
31). Explain Zero Consistency?
Ans: In this write operations will be handled in the background, asynchronously. It is
the fastest way to write data, and the one that is used to offer the least confidence that
operations will succeed.
32). What do you understand by Thrift?
Ans: Thrift is the name of the RPC client used to communicate with the Cassandra
server.
33). What do you understand by Kundera?
Ans: Kundera is an object-relational mapping (ORM) implementation for Cassandra
written using Java annotations.
34). JMX stands for?
Ans: Java Management Extension
35). What is the difference between Cassandra, Hadoop Big Data, MongoDB,
CouchDB?
Ans: http://www.interviewquestionspdf.com/2015/10/what-is-difference-betweencassandra.html
36). When to use Cassandra?
Ans: Being a part of NoSQL family Cassandra offers solution for problem where your
requirement is to have very heavy write system and you want to have quite responsive
reporting system on top of that stored data. Consider use case of Web analytic where
log data is stored for each request and you want to built analytical platform around it
to count hits by hour, by browser, by IP, etc in real time manner.
37). When should you not use Cassandra? OR When to use RDBMS instead of
Cassandra?
Ans: Cassandra is based on NoSQL database and does not provide ACID and
relational data property. If you have strong requirement of ACID property (for
example Financial data), Cassandra would not be a fit in that case. Obviously, you can
make work out of it, however you will end up writing lots of application code to

handle ACID property and will loose on time to market badly. Also managing that
kind of system with Cassandra would be complex and tedious for you.
38). What are secondary indexes?
Ans: Secondary indexes are indexes built over column values. In other words, lets
say you have a user table, which contains a users email. The primary index would be
the user ID, so if you wanted to access a particular users email, you could look them
up by their ID. However, to solve the inverse query given an email, fetch the user ID
requires a secondary index.
39). When to use secondary indexes?
Ans: You want to query on a column that isn't the primary key and isn't part of a
composite key. The column you want to be querying on has few unique values (what I
mean by this is, say you have a column Town, that is a good choice for secondary
indexing because lots of people will be form the same town, date of birth however
will not be such a good choice).
40). When to avoid secondary indexes?
Ans: Try not using secondary indexes on columns contain a high count of unique
values and that will produce few results.
41). I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is
that XX% or 0.XX% ?
Ans: XX%
42). What happens to existing data in my cluster when I add new nodes?
Ans: When a new nodes joins a cluster, it will automatically contact the other nodes in
the cluster and copy the right data to itself.
43). What are "Seed Nodes" in Cassandra?
Ans: A seed node in Cassandra is a node that is contacted by other nodes when they
first start up and join the cluster. A cluster can have multiple seed nodes. Seed node
helps the process of bootstrapping for a new node joining a cluster. Its recommended
to use the 2 seed node per data center.
44). What are "Coordinator Nodes" in Cassandra?
Ans: Coordinator Nodes: Its a node which receive the request from client and send the
request to the actual node[hash(key) => token] depending upon the token. So all the
nodes acts as coordinator node,because every node can receive a request and proxy
that request.

45). What are the befefits of NoSQL over relational database?


Ans: NoSQL overcome the weaknesses that the relational data model does not address
well, which are as follows:

Huge volume of sructured, semi-structured, and unstructured data


Flexible data model(schema) that is easy to change
Scalability and performance for web-scale applications
Lower cost
Impedance mismatch between the relational data model and object-oriented
programming
Built-in replication
Support for agile software development

46). What ports does Cassandra use?


Ans: By default, Cassandra uses 7000 for cluster communication, 9160 for clients
(Thrift), and 8080 for JMX. These are all editable in the configuration file or
bin/cassandra.in.sh (for JVM options). All ports are TCP.
47). What do you understand by High availability?
Ans: A high availability system is the one that is ready to serve any request at any
time. High avaliability is usually achieved by adding redundancies. So, if one part
fails, the other part of the system can serve the request. To a client, it seems as if
everything worked fine.
48). How Cassandra provide High availability feature?
Ans: Cassandra is a robust software. Nodes joining and leaving are automatically
taken care of. With proper settings, Cassandra can be made failure resistant. That
means that if some of the servers fail, the data loss will be zero. So, you can just
deploy Cassandra over cheap commodity hardware or a cloud environment, where
hardware or infrastructure failures may occur.

49). Who uses Cassandra?


Ans: Cassandra is in wide use around the world, and usage is growing all the time.
Companies like Netflix, eBay, Twitter, Reddit, and Ooyala all use Cassandra to power
pieces of their architecture, and it is critical to the day-to-da operations of those
organizations. to date, the largest publicly known Cassandra cluster by machine count
has over 300TB of data spanning 400 machines.
Because of Cassandra's ability to handle high-volume data, it works well for a myriad
of applications. This means that it's well suited to handling projects from the highspeed world of advertising technology in real time to the high-volume world of bigdata analytics and everything in between. It is important to know your use case before
moving forward to ensure things like proper deployment and good schema design.
50). When to use secondary indexes?
Ans: You want to query on a column that isn't the primary key and isn't part of a
composite key. The column you want to be querying on has few unique values (what I
mean by this is, say you have a column Town, that is a good choice for secondary
indexing because lots of people will be form the same town, date of birth however
will not be such a good choice).
51). When to avoid secondary indexes?
Ans: Try not using secondary indexes on columns contain a high count of unique
values and that will produce few results.
52). What do you understand by Snitches?
Ans: A snitch determines which data centers and racks nodes belong to. They inform
Cassandra about the network topology so that requests are routed efficiently and
allows Cassandra to distribute replicas by grouping machines into data centers and
racks. Specifically, the replication strategy places the replicas based on the
information provided by the new snitch. All nodes must return to the same rack and
data center. Cassandra does its best not to have more than one replica on the same
rack.
53). What is Hector?
Ans: Hector is an open source project written in Java using the MIT license. It was
one of the early Cassandra clients and is used in production at Outbrain. It wraps
Thrift and offers JMX, connection pooling, and failover.
54). What do you understand by NoSQL CAP theorem?
Ans:

Consistency: means that data is the same across the cluster, so you can read or
write to/from any node and get the same data.
Availability: means the ability to access the cluster even if a node in the cluster
goes down.
Partition: Tolerance means that the cluster continues to function even if there is
a "partition" (communications break) between two nodes (both nodes are up,
but can't communicate).

In order to get both availability and partition tolerance, you have to give up
consistency. Consider if you have two nodes, X and Y, in a master-master setup.
Now, there is a break between network comms in X and Y, so they can't synch
updates. At this point you can either:
A) Allow the nodes to get out of sync (giving up consistency), or
B) Consider the cluster to be "down" (giving up availability)
All the combinations available are:

CA - data is consistent between all nodes - as long as all nodes are online - and
you can read/write from any node and be sure that the data is the same, but if
you ever develop a partition between nodes, the data will be out of sync (and
won't re-sync once the partition is resolved).
CP - data is consistent between all nodes, and maintains partition tolerance
(preventing data desync) by becoming unavailable when a node goes down.
AP - nodes remain online even if they can't communicate with each other and
will resync data once the partition is resolved, but you aren't guaranteed that all
nodes will have the same data (either during or after the partition)

55). What is Keyspace in Cassandra?


Ans: Before doing any work with the tables in Cassandra, we have to create a
container for them,

otherwise known as a keyspace. One of the main uses for keyspaces is defining a
replication
mechanism for a group of tables.
Example:
CREATE KEYSPACE used_cars WITH replication = { 'class': 'SimpleStrategy',
'replication_factor' : 1};
56). Explain Cassandra data model?
Ans: The Cassandra data model has 4 main concepts which are cluster, keyspace,
column, column family.
Clusters contain many nodes(machines) and can contain multiple keyspaces.
A keyspace is a namespace to group multiple column families, typically one per
application.
A column contains a name, value and timestamp.
A column family contains multiple columns referenced by a row keys.
For more/latest interview questions visit : http://www.interviewquestionspdf.com/

Você também pode gostar