NoSQL
In the previous lesson, we understood about SQL databases. In this lesson, we talk about NoSQL databases, which are widely used for storing data that is not relational. We talk about the properties of NoSQL and their use cases. This lesson will help you answer questions like
Why do we use NoSQL ?
What guarantees and constraints are present with NoSQL ?
What could be the problems associated with NoSQL ?
What should we consider while deciding the NoSQL database ?
How do we scale it ?
What is NoSQL
NoSQL, also known as Not Only SQL or Not SQL databases are non-tabular. NoSQL databases don't store data as relational tables, but instead store data as documents, a wide-column store, key-value or graph. This doesn't imply that No SQL databases don't allow maintaining relationships between data, but instead model relationships differently. These databases also support CRUD operations.
Properties of NoSQL
Flexible and dynamic schema
Horizontally scalable: distributed in nature, these databases allow easy scaling of data across machines.
Availability: NoSQL databases primarily have an emphasis on availability, implying that the database can be accessed/queried by compromising consistency.
Eventual consistency: NoSQL databases provide eventual consistency, that is a transaction once completed will make changes to data eventually to all nodes/machines. This might return stale data on query.
NoSQL databases don't usually support ACID.
NoSQL databases are very fast, providing fast reads and writes.
Also, they support thousands of users concurrently.
Different NoSQL databases are built for different purposes, for example MongoDb is a document database, excellent with unstructured data and fast on simple queries, while Apache Cassandra is a wide column store, built for fast reads and writes and poor for updates or deletes.
Examples of NoSQL
MongoDB: Document store with flexible schema
Apache Cassandra: Wide column store with fast reads and writes
Apache HBase: Wide column store for large datasets providing consistency
Redis: In-memory key value store, based on master-slave architecture
Memcached: In-memory key value store, for smaller data with a flat hierarchy
Quick Q&A
What would you do when a certain machine in your NoSQL cluster becomes a bottleneck due to the number of writes ?
This kind of problem is known as hot partitions. You can also change the partitioning scheme to distribute the load across machines. You can change the hashing key to another entity or change the hashing function. Another possible solution is to add machines and use consistent hashing. We will talk about hashing in the future chapters.
What would you do when a certain machine in your NoSQL cluster becomes a bottleneck due to the number of reads ?
You can increase the replication of the partitions to redirect the read traffic load across slaves. Master slave architecture redirects all write requests to master and read requests to slaves. Because NoSQL architecture provides eventual consistency, all write updates to master will slowly be propagated to slaves.
How do you decide the data partitioning scheme in your NoSQL cluster ?
There are various partitioning schemes that can be used. Key-based partitioning, range based partitioning or hash based partitioning are some of the few options that can be used. Say, you are saving user information, then you can use user id to hash the content across the machines. We will talk more about hashing in next chapters.
What would you do in scenarios where the data to be queries is stored across multiple nodes in NoSQL cluster ?
Let's say, we are trying to find nearby places from your location, and say the database is hashed on placeId. In this case we need to find nearby places by going to each machine and finding the closest places. Once the data is gathered from each machine, we will have to aggregate/collect it at a central place to be used for further purpose. This is a very common architecture pattern, where we need to perform aggregation because the data is stored across machines.
What would you choose between storing data for a hot topic on a single machine or partition on multiple machines ?
In the above question, we could have saved the places data on a single machine, by hashing based on earth grid information (say all places in a 10km X 10km are stored in one machine). This will help you query nearby places from your location since all relevant data is on single machine and will result in fast computation. However, if the number of read requests to this machine increase significantly (say all people in crowded areas use this service), the machine will become a bottleneck. The aggregation approach discussed in the previous question will be fast because each machine will be queried parallelly, and the aggregation will not create any bottlenecks.
For More Reading
Last updated
Was this helpful?