- Service Delivery Program
- About Us
This blog aims to deep-dive into Azure Cassandra MI, which became GA in Nov 2020. I have mostly referred to Datastax, Apache, and Microsoft’s official web pages to produce this content. If you feel any information is missing, please connect with me.
Today’s business is driven mainly by new ideas and innovative solutions that rely on information and intelligence gathered by an unending stream of Data flowing from various sources. Organizations can customize their solutions according to customer needs and markets. However, not always is this data in a structured manner. This is where NoSQL databases come into focus. There are many options available in the market for NoSQL databases. Apache Cassandra is one of the most sought-after non-relational databases today. Cassandra is a highly available and scalable distributed No-SQL databases system with no single point of failure. Unlike SQL databases, Cassandra clusters are schema-free and designed to handle large amounts of incoming data and hundreds or thousands of writes per second.
Facebook developed Cassandra to handle their Inbox search feature. It became open-sourced in 2008 until Apache Incubator accepted it in 2009. Since early 2010, Apache has been a top project due to its growing popularity.
Image source: Tutorialspoint.com
A Cassandra ring consists of homogeneous nodes connected in a peer-to-peer model. The schema in Cassandra is not fixed, and the queries performed on the cluster are based on the Partition+ Clustering key combination. Therefore, it doesn’t support Cross Partition Transactions, Distributed Joints, and Foreign Keys.
A Cassandra cluster can be deployed on a single VM, multiple VMs, Kubernetes, or the cloud (AWS, Azure). The main reason behind Cassandra’s distributed architecture is that a hardware failure can and does occur frequently. All distributed storage system works on the principle of the CAP theorem. CAP theorem states that any distributed system can powerfully deliver any two out of the three properties: Consistency, Availability, and Partition-tolerance. However, in the case of Cassandra, it maintains a wonderful balance between consistency, availability, and to an extent, Partition Tolerance.
Cassandra is referred to as an “eventually consistent” system, which means that all the nodes will have the same data set at no point in time. However, the consistency level can be tuned for reading and writing operations by a database client. Here, the “Consistency Level (CL)” can be defined as the response from the number of replicas before a read/write operation is considered successful.
Commonly used CLs: One, Two, Three, All, Quorum, Local, Each.
Multiple copies of data are written to nodes on the cluster. Partition Tolerance:
Regarding Partition Tolerance, the Cassandra cluster continues to function even if a node or a group of nodes goes down.
The main components of a Cassandra cluster include the following:
Internode communication takes place with the help of the “Gossip” protocol. Nodes constantly share state and location information about themselves and up to three other nodes. Moreover, the state information of each node is persisted locally to enable access to this information quickly by a node in case it restarts.
Image source: Cassandra.apache.org
When one of the replica nodes becomes unavailable, the write operations on this node go unsuccessful. In this situation, a coordinator in the cluster creates a “Hinted Handoff” and persists it locally for nearly 3 hours. Once the node comes back up and starts gossiping with other nodes, hints are played to this node until it is in sync with the other nodes.
For Read operations, if there is a mismatch in the data residing on the read replicas, a Read Repair is performed to ensure an UpToDate data is presented to the customer.
The incoming data into a Cassandra Cluster needs to be directed to one of the nodes for processing. The Partition key is taken care of in this task, also known as the Primary Key. Each node has an equal part of the Partition key hash value. A partition key is converted into a hash value by a Partitioner, issuing tokens and determining which node will be responsible for which token range. Murmur3Partitioner is the default Partitioner for Cassandra. Each node is assigned tokens that are signed integer values between -2^63 to +2^63-1.
For example, if there are 4 nodes in the cluster, the token range is between 0-99, node1 gets 0-24 key hash value, node 2 gets 25-49, etc.
Data is replicated across nodes for high availability. “Replication Factor (RF)” determines the total number of replicas maintained across the cluster in each keyspace. E.g., if RF is set to two at a datacenter level, two copies of data are maintained on two different nodes. Caution should be taken while setting the RF. It should never be less than one and never be more than the number of nodes in the cluster. RF can be set at the Keyspace level as well.
One copy of data resides on the owner node with a token range for that data. The remaining replicas are placed on other nodes in the cluster clockwise depending on the set Replication strategy.
Two main replication strategies are available:
Along with the Replication Strategy, a “Snitch” helps route the requests efficiently as they keep an account of the network topology of the Cassandra cluster.
Writes in Cassandra Reads in Cassandra
Image source: baeldung.com
A coordinator (one of the Cassandra nodes) acts as a broker to facilitate read/write processes in a cluster for both reads and writes.
Both actions depend mainly on Consistency Level, Replication Factor, and the network topology configured per cluster. Cassandra is designed to be write-intensive and compared to other no-sql databases; the write performance is outstanding in its architectural design.
Apache Cassandra’s structure is “built to scale” and handles massive amounts of data. Every corporation stores huge amounts of data across various systems and requires providing control and access to users to this data; Apache Cassandra makes this a reality without a single point of failure, very less downtime, and significantly improved uptime. Due its varied benefits major companies are utilizing Apache Cassandra to leverage their business requirements.
This blog is a 3-part series for Apache Cassandra’s No-SQL database. The first part gives the reader a brief overview of Open-Source Cassandra. The subsequent parts will deal with Azure Managed Instance for Apache Cassandra features, Use cases, Installation, Connectivity, and Migration options.
We here at CloudThat are the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge on cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Drop a query if you have any questions regarding Apache Cassandra utilization, and I will get back to you quickly. To get started, go through our Expert Advisory page and Managed Services Package that is CloudThat’s offerings.
Saritha is a Subject Matter Expert - Kubernetes working as a lead at CloudThat. She has relentlessly kept upskilling with the latest trending technologies and delivering the best cloud-native solutions to Businesses in all sectors. She is a Microsoft Certified Solution Architect, Certified Windows Server Administrator, and Windows Networking professional and strives towards providing the best cloud experience to our customers through transparent communication, methodical approach, and diligence.
Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!