NoSQL
In the previous lesson, we understood about SQL databases. In this lesson, we talk about NoSQL databases, which are widely used for storing data that is not relational. We talk about the properties of NoSQL and their use cases. This lesson will help you answer questions like
Why do we use NoSQL ?
What guarantees and constraints are present with NoSQL ?
What could be the problems associated with NoSQL ?
What should we consider while deciding the NoSQL database ?
How do we scale it ?
What is NoSQL
NoSQL, also known as Not Only SQL or Not SQL databases are non-tabular. NoSQL databases don't store data as relational tables, but instead store data as documents, a wide-column store, key-value or graph. This doesn't imply that No SQL databases don't allow maintaining relationships between data, but instead model relationships differently. These databases also support CRUD operations.
Properties of NoSQL
Flexible and dynamic schema
Horizontally scalable: distributed in nature, these databases allow easy scaling of data across machines.
Horizontal Scalability: Horizontal scalability is the ability of a system to increase its capacity by adding more hardware or machines to it, thereby creating a cluster of machine resources. This allows you to use commodity hardware for parallel low cost tasks, but resulting in a faster, scalable system. Horizontal scalability achieves concurrency and lesser downtime(availability).
Availability: NoSQL databases primarily have an emphasis on availability, implying that the database can be accessed/queried by compromising consistency.
Availability: Availability is a property of the system which measures the degree to which the system is operable/usable. High availability implies that the system can be used most of the time and has minimal downtime. Availability can also be variable with respect to read or writes, for example a system can have high read availability but low write availability.
Eventual consistency: NoSQL databases provide eventual consistency, that is a transaction once completed will make changes to data eventually to all nodes/machines. This might return stale data on query.
Eventual Consistency: Eventual consistency is the tradeoff for high availability which guarantees that the data modifications once completed will eventually reflect to all machines in the cluster. This is a tradeoff when availability is preferred and the data accuracy in reads can be compromised. Usually the propagation of data to machines usually happens in seconds , or minutes depending on the system. An example of it could be when a user updates his city on social media platform.
NoSQL databases don't usually support ACID.
NoSQL databases are very fast, providing fast reads and writes.
Also, they support thousands of users concurrently.
Different NoSQL databases are built for different purposes, for example MongoDb is a document database, excellent with unstructured data and fast on simple queries, while Apache Cassandra is a wide column store, built for fast reads and writes and poor for updates or deletes.
Examples of NoSQL
MongoDB: Document store with flexible schema
Apache Cassandra: Wide column store with fast reads and writes
Apache HBase: Wide column store for large datasets providing consistency
Redis: In-memory key value store, based on master-slave architecture
Memcached: In-memory key value store, for smaller data with a flat hierarchy
Quick Q&A
What would you do when a certain machine in your NoSQL cluster becomes a bottleneck due to the number of writes ?
This kind of problem is known as hot partitions. You can also change the partitioning scheme to distribute the load across machines. You can change the hashing key to another entity or change the hashing function. Another possible solution is to add machines and use consistent hashing. We will talk about hashing in the future chapters.
What would you do when a certain machine in your NoSQL cluster becomes a bottleneck due to the number of reads ?
You can increase the replication of the partitions to redirect the read traffic load across slaves. Master slave architecture redirects all write requests to master and read requests to slaves. Because NoSQL architecture provides eventual consistency, all write updates to master will slowly be propagated to slaves.
How do you decide the data partitioning scheme in your NoSQL cluster ?
There are various partitioning schemes that can be used. Key-based partitioning, range based partitioning or hash based partitioning are some of the few options that can be used. Say, you are saving user information, then you can use user id to hash the content across the machines. We will talk more about hashing in next chapters.
What would you do in scenarios where the data to be queries is stored across multiple nodes in NoSQL cluster ?
Let's say, we are trying to find nearby places from your location, and say the database is hashed on placeId. In this case we need to find nearby places by going to each machine and finding the closest places. Once the data is gathered from each machine, we will have to aggregate/collect it at a central place to be used for further purpose. This is a very common architecture pattern, where we need to perform aggregation because the data is stored across machines.
What would you choose between storing data for a hot topic on a single machine or partition on multiple machines ?
In the above question, we could have saved the places data on a single machine, by hashing based on earth grid information (say all places in a 10km X 10km are stored in one machine). This will help you query nearby places from your location since all relevant data is on single machine and will result in fast computation. However, if the number of read requests to this machine increase significantly (say all people in crowded areas use this service), the machine will become a bottleneck. The aggregation approach discussed in the previous question will be fast because each machine will be queried parallelly, and the aggregation will not create any bottlenecks.
For More Reading
Last updated