No products in the cart.
Non-relational database system is another unique offering in the field of data-intensive
computing. Using this database model, the new age non-relational data-sets can be processed
more efficiently than the traditional database with relational systems. The new age high
performance computing environments like cloud computing systems are heavily dependent
on these non-relational database systems for efficient storage and retrieval of data. This section
focuses on the need of such a database, discusses its storage architecture and also briefs few
such popular databases.
14.5.1 Emergence of Large Volume of Unstructured Data-sets
Earlier data managed by enterprise applications were structured in nature and less in volume.
But with the introduction of web based portals during the end of last century, the nature of
web content or data started changing. Volume of data started to grow exponentially and data
became unstructured in character. Such data-sets were classified later and their characteristics
were identified. This type of data or data-set is referred as ‘Big data’.
14.5.1.1 Big data
Big data is used to describe both structured and unstructured data that is massive in volume.
It also considers data those are too diverse in nature and highly dynamic (very fast-changing).
Differently put, the new age data whose volume, velocity or variety is too great are termed as
Big data. Three said characteristics of Big data are described below.
Volume: A typical PC probably had 10 gigabytes of storage in the year of 2000. During
that time, excessive data volume was a storage issue as storage was not so cheap
like today. Today social networking sites use to generate few thousand terabytes of data
every day. 244
Cloud Computing
Velocity: Data streaming nowadays are happening at unprecedented rate as well as with
speed. So things must be dealt in a timely manner. Quick response to customers’ action is
a business challenge for any organization.
Variety: Data of all formats are important today. Structured or unstructured texts, audio,
video, image, 3D data and others are all being produced every day.
The above characteristics cause variability and complexity in terms of managing big data.
Variability in the sense that data flow can be highly inconsistent with periodic peaks as they are
in social media or in e-commerce portals. The complexity often comes with it when it becomes
difficult to connect and correlate data or define their hierarchies.
It is to be noted that big data is not only about the volume of data; rather it considers other
characteristics of new age data, like their variety or speed of generation.
14.5.2 Time Appeared for an Alternative Database Model
Since the emergence of relational database, enterprise applications started using it since 1980s.
Those relational database systems were developed to store and process structured data-sets.
But the database system started facing challenge as the volume of data started increasing
exponentially from the end of the last century and the situation worsened after the introduction
of web based social networking and e-commerce portals. Soon, the concept of big data emerged.
Online Transaction Processing (OLTP) applications flooded the web with very high volume
of data from the beginning of the current century. These applications needed to function under
stiff latency constraints to provide consistent performance to a very large number of users as
hundreds of millions of clients throughout the world were accessing such applications. These
sites were experiencing massive variations in traffic also. Some of these hikes were due to
predictable events like New Year, business release or sporting event, but most of others were
unpredictable events which becomes more difficult to manage. Data were being accessed
more frequently and needed to be processed more intensively.
Relational databases are appropriate for a wide range of tasks but not for every task.
The basic operations at any database are read and write. Read operations can be scaled by
distributing and replicating data to multiple servers. But inconsistency in data may happen
when write or update operation takes place. And with the new age data, the number of writers
are often much larger than the number of readers, especially in popular social networking sites.
One solution to this problem is to exclusively partition the data during distribution. But with
that also the distributed unions (of data from database tables) may become slower and harder
to implement if the underlying storage architecture is not supported for doing so.
Here, the main problem was that the traditional SQL databases with relational systems
do not scale well. Traditional DBMSs can only ‘scale up’ (vertical scaling) or increase the
resources on a central server. But, efficient processing of big data require an excellent ‘scale out’
(horizontal scaling) capability.245
Database Technology
Web applications were moving towards cloud computing model, and it did not take very long
to the pioneer of cloud computing services like Google, Amazon, and many other e-commerce
and social networking companies as well as technologists to realize that traditional relational
databases are no more enough for handling the new age data. They started to look for a suitable
database solution.
Traditional SQL databases do not fit well with the concept of horizontal scaling and horizontal
scalability is the only way to scale them indefinitely.
14.5.2.1 Modern Age Database Requirements
Horizontal scaling appeared as one of the necessary attributes of database system to keep pace
with the processing needs of large data-sets. It appeared impossible to deliver high-performance
without distributing those data among multiple nodes and processing them in parallel. The other
major concern was the latency associated with transactions. This latency could be reduced by
caching frequently-used data in-memory on dedicated servers, instead of fetching them every
time required. These facilities had to be incorporated in the new age database systems to reduce
the response time and enhance the performance of applications. The databases had to be highly
optimized for simple retrieval and appending operations. These things, along with many other
issues, worked as the driving forces behind the development of an alternative database system.
14.5.2.2 Role of Cloud Storage System
The characteristics of storage system had changed during this time. From the earlier concern
regarding cost of storage space, the cost of storage management was gradually becoming the
dominant element of storage systems. That opened the opportunity for replication of files into
storage across different geographic locations and hence the uses of distributed file systems
became widespread.
In such a scenario, the evolution of storage strategy started introducing many different
models of distributed file systems like General Parallel File System (GPFS), Google File System
(GFS) or Hadoop Distributed File System (HDFS) and else. All of these works well in high
performance computing environments. Characteristics of such file systems and their storage
strategy suited well with cloud’s dynamic architecture. This created opportunity of developing
scalable database systems (over these file systems) to store and manage the modern age data.
Cloud native databases are facilitated by distributed storage systems and they are closely
associated with one another. Hence, the storage and database system often overlaps.
14.5.3 NoSQL DBMS
NoSQL is a class of database management system that does not follow all of the rules of a
relational DBMS. The term NoSQL can be interpreted as ‘Not Only SQL’ as it is not a replacement
but rather it is a complementary addition to RDBMS. This class of database uses some SQL 246
like query languages to make queries but does not use the traditional SQL (structured query
language).
The term NoSQL was coined by Carlo Strozzi in the year of 1998 to name the file-based
open-source relational database he was going to develop which did not have an SQL interface.
However, this initial usage of the term NoSQL is not directly linked with the NoSQL being
used at present. The term drew attention in 2009 when Eric Evans (an employee of a cloud
hosting company, Rackspace) used it in a conference to represent the surge of developing non
relational distributed databases then.
NoSQL is not against SQL and it was developed to handle unstructured big data in an efficient
way to provide maximum business value.
14.5.3.1 The Evolution
The NoSQL movement slowly started in the early years of current century as the IT industry
started to realize the need of new database system in order to support web-based applications.
The initial advances got its space when computing majors Google and Amazon published two
papers successively in 2006 and 2007.
14.5.3.2 The BigTable Revolution
In 2004, Google employed a team to develop a storage system to manage Big data. BigTable is
outcome of that. It is a proprietary distributed storage system built by Google on GFS and is
in use from 2005. The storage system was built to manage large structured data-sets and
was designed to scale to a very large size. It is structured as large table which may be peta
bytes in size and distributed among tens of thousands of machines. BigTable has successfully
provided a flexible, high-performance solution for Google products like Google Earth, Google
Analytics and Orkut.
Later, this BigTable has had a large impact on NoSQL database design when Google publicly
disclosed the details of it in a technical paper in 2006. This opened the scope to the technologists
for an Open-source development of BigTable like database. Thus, HBase database developed
by Apache Foundation and Cassandra developed at Facebook were surfaced in the market.
Meanwhile, during all of these developments, Amazon also published a paper on their Dynamo
storage system in 2007 which was also built to address the challenges of working with big data.
Big Table, although built as storage, resembles database system in many ways. It also shares
many implementation strategies of database technologies.
The NoSQL database development process remained closely associated with the developments
in the field of cloud native file systems (or, cloud storage systems) during those days. Soon, many
other players of web services started working on the technology and in a short period of time,
starting around the year of 2008, all of these developments became the source of a technology
revolutions. The NoSQL database became prominent after 2009 as the general terminology
‘NoSQL’ was adopted to set apart these new databases or more correctly for the file systems.247
NoSQL database development has been closely associated with scalable file system
development in computing.
14.5.3.3 CAP Theorem
The abbreviation CAP stands for Consistency, Availability and Partition tolerance of data.
CAP theorem (also known as Brewer’s theorem) says that it is impossible for a distributed
computer system to meet all of three aspects of CAP simultaneously. Eric Brewer of University
of California, Berkeley presented the theorem in the ACM (Association of Computing
Machinery) conference in 2000.
â– Consistency: This means that data in the database remains consistent after execution of an
operation. For example, once a data is written or updated, all of the future read requests will
see that data.
â– Availability: It guarantees that the database always remains available without any downtime.
â– Partition tolerance: Here the database should be partitioned in such a way that if one part of
the database becomes unavailable, other parts remain unaffected and can function properly.
This ensures availability of information.
Any database system must follow this ‘two-of-three’ philosophy. Thus, the relational database
which focuses highly on consistency issue sacrifices the ‘partition tolerance’ attribute of CAP
(Figure 14.1). It is already discussed that one of the primary goals of NoSQL systems is to
boost horizontal scalability. To scale horizontally, a system needs strong network partition
tolerance which needs to give up either ‘consistency’ or ‘availability’ attribute of CAP. Thus, all
of the NoSQL databases follow either combinations of CP (consistency-partition tolerance) or
AP (availability-partition tolerance) from the attributes of the CAP theorem. This means some
of the NoSQL databases even drops consistency as an essential attribute. For example, while
HBase maintains CP criteria the other popular database Cassandra maintains AP criteria.
Some of the NoSQL databases even choose to relax the ‘consistency’ issue from the CAP
criteria and this philosophy suits well in certain distributed applications.
Different combinations of CAP criteria are to serve different kinds of requirements. Database
designers analyze specific data processing requirements before choosing one.
CA: It is suitable for systems being designed to run over cluster on a single site so that all
of the nodes always remain in contact. Hence, the worry of network partitioning problem
almost disappears. But, if partition occurs, the system fails.
CP: This model is tolerant to network partitioning problem, but suitable for systems where
24 × 7 availability is not a critical issue. Some data may become inaccessible for a while but
the rest remains consistent or accurate.
AP: This model is also tolerant to network partitioning problem as partitions are designed
to work independently. 24 × 7 availability of data is also assured but sometimes some of the
data returned may be inaccurate.
Partition tolerance is an essential criteria for NoSQL databases as one of their primary goals is
the horizontal scalability.
14.5.3.4 BASE Theorem
Relational database system treats consistency and availability issues as essential criteria.
Fulfillments of these criteria are ensured by following the ACID (Atomicity, Consistency,
Isolation and Durability) properties in RDBMS. NoSQL database tackles the consistency issue
in a different way. It is not so stringent on consistency issue; rather it focuses on partition
tolerance and availability. Hence, NoSQL database no more need to follow the ACID rule.
NoSQL database should be much easier to scale out (horizontal scaling) and capable of
handling large volume of unstructured data. To achieve these, NoSQL databases usually follow
BASE principle which stands for ‘Basically Available, Soft state, Eventual consistency’. The
BASE theorem was also defined by Eric Brewer who is known for formulating the CAP theorem.
The three criteria of BASE are explained below:
â– Basically Available: This principle states that data should remain available even in the
presence of multiple node failures. This is achieved by using a highly-distributed approach
with multiple replications in the database management.
â– Eventual Consistency: This principle states that immediately after operation, data may look
like inconsistent but ultimately they should converge to a consistent state in future. For
example, two users querying for same data immediately after a transaction (on that data)
may get different values. But finally, the consistency will be regained.249
â– Soft State: The eventual consistency model allows the database to be inconsistent for some
time. But to bring it back to consistent state, the system should allow change in state over
time even without any input. This is known as Soft state of system.
BASE does not address the consistency issue. The AP region of Figure 14.1 follows the BASE
theory. The idea behind this is that data consistency is application developer’s problem and
should be handled by developer through appropriate programming techniques. Database will
no more handle the consistency issue. This philosophy helps to achieve the scalability goal.
To satisfy the scalability and data distribution demands in NoSQL, it was no longer possible to
meet all the four criteria of ACID simultaneously. Hence, BASE theorem was proposed as an
alternative.
14.5.4 Features of NoSQL Database
NoSQL database introduces many new features in comparison with relational databases. Few
of those features oppose relational-DBMS concept. They can be listed as schema-free, non
relational, horizontally scalable and distributed.
14.5.4.1 Flexible Schemas
Relational database system cannot address data whose structure is not known in advance. They
need to define the schema of the database and tables before storing any data in it. But, with this
schema-based design, it becomes difficult to manage agile data sets. When at the middle of the
business, it needs to introduce a new field (column) in some table, then it becomes extremely
disruptive as that require alteration of the schema. This is a very slow process and involves
significant downtime.
NoSQL databases are designed to allow insertion of data without a pre-defined schema. This
makes it very easy to incorporate real-time changes in application at the time of requirement as
that does not cause service interruption.
Unlike relational database, NoSQL database is schema-free.
14.5.4.2 Non-relational
NoSQL database can manage non-relational data efficiently along with relational data. The
relational constraints of RDBMS are not applicable in this database. This makes it easier to
manage non-relational data using NoSQL database.
14.5.4.3 Scalability
Relational databases are designed to scale vertically. But vertical scaling has its own
limitations as it does not allow new servers to be introduced into the system to share the load. 250
Horizontal scalability is the only way to scale indefinitely and that is also cheaper than vertical
scaling. NoSQL database is designed to scale horizontally with minimum effort.
14.5.4.4 Auto-distribution
Distributed relational databases allow fragmentation and distribution of a database across
multiple servers. But, that does not happen automatically as it is a manual process to be handled
by application making it difficult to manage. On the other hand, the distribution happens
automatically in NoSQL databases. Application developers need not to worry about anything.
All of these distribution and load balancing acts are automated in the database itself.
Distribution and replication of data segments are not inherent features of relational database;
these are responsibilities of application developers. In NoSQL, these happen automatically.
14.5.4.5 Auto-replication
Not only fragmentation and distribution, replication of database fragments are also an
automatic process in NoSQL. No external programming is required to replicate fragments
across multiple servers. Replication ensures high availability of data and supports recovery.
14.5.4.6 Integrated Caching
NoSQL database often provides integrated caching capability. This feature reduces latency
and increases throughput by keeping frequently-used data in system memory as much as
possible. In relational database, a separate caching layer needs to be maintained to achieve this
performance goal.
But one thing needs to be added here that although NoSQL database offers many advantages
over relational database, it fails to provide the rich reporting and analytical functionality like
RDBMS in some specific scenarios.
Despite many benefits, NoSQL fails to provide the rich analytical functionality in specific cases
as RDBMS serves.
14.5.5 NoSQL Database Types
There are four different types of NoSQL databases. Each of them is designed to address the need of
some particular classes of problems. Various NoSQL database service providers try to offer solution
for different types of problems. Following sections describe four different NoSQL database types.
14.5.5.1 Key-Value Database
The Key-Value Database (or KV Store) is the simplest among the various NoSQL databases.
It pairs up data with a key and maintains the database like a hash-table where data values are 251
referred by the keys. The main benefit of such pairing makes it easily scalable. However, it is not
suitable where queries are based on the value rather than on the key. Amazon’s DynamoDB,
Azure Table Storage and CouchDB are few popular examples of this type of NoSQL databases.
14.5.5.2 Document-Oriented Database
A Document Oriented Database (or Document Store) is an application where data is stored in
documents. It is similar to the Key-Value stores with the values stored in structured documents.
The documents are addressed and can be retrieved from the database using key. This key can
be a path, a URI or a simple string.
The documents are schema-free and can be of any format as long as the database application
can understand its internal structure. Generally document-oriented databases use some of the
XML, JSON (JavaScript Object Notation) or Binary JSON (BSON) formats.
One document can be referred by multiple keys and a document can refer to other
documents by storing their keys. But each document is treated as stand-alone and there is no
constraint to enforce relational integrity. MongoDB, Apache CouchDB, Couchbase are few
popular examples of document-oriented databases.
14.5.5.3 Column-Family Database
A Column-Family Database (or Wide-Column Data Store/Column Store) stores data grouped
in columns. Each column consists of three elements as name, value and a time-stamp. Name
is used to refer the column and time-stamp is used to identify actual required content. For
example, the time-stamp is useful in finding up-to-date content. A similar type of columns
together forms a column family which are often accessed together. A column family can
contain virtually unlimited number of columns.
In relational databases, each row is stored as a continuous disk entry. Different rows may
get stored in different places on disk. Contrary to this, in column-family database, all of the
cells corresponding to a column are stored as a continuous disk entry. This makes the access
of data faster. For example, searching of a particular title from a record of million books
stored in relational data model is an intense task as that will cause millions of accesses to disk.
On the other hand, using column-family data model, the title can be found with single disk
access only.
The difference between column stores and key-value stores is that column stores are
optimized to handle data along columns. Column stores show better analytical power and
provide improved performance by imposing a certain amount of rigidity to a database schema.
In some ways, the column stores are an intermediate solution between traditional RDBMSs and
key-value stores. Hadoop’s Hbase is the best example of popular column store-based database.
14.5.5.4 Graph Database
In the Graph Database (or Graph Store) data is stored as graph structures using nodes and
edges. The entities are represented as nodes and the relationship between entities as edges.
Graph database follows index-free adjacency where every node directly points to its adjacent
nodes. In this set up, the cost of a hop or tour remains same as the number of nodes increases. 252
This is useful to store information about relationships when number of elements are huge, such
as social connections. Twitter uses such database to track who is following whom. Examples of
popular graph-based databases include Neo4J, Info-Grid, Infinite Graph and few others.
14.5.6 Selecting the Suitable NoSQL Database Solution
Each of the NoSQL database types has its own strength and weaknesses. They are designed to
serve different kind of data storage requirements and hence are not comparable to each other.
The ‘one-size-fits-all’ philosophy of relational databases is not applicable in NoSQL database
domain. Here the users have the flexibility of choosing of multiple options after analyzing the
requirements of their applications.
Selecting the NoSQL database strategy is not a one-time decision. First, one will have
to identify the requirements of an application which are not met by relational database
systems. Then suitable NoSQL database solution has to be identified to meet those unfulfilled
requirements. Even more than one NoSQL database types may be used to meet all of these
necessities.
Sometime a single application may provide optimized performance when more than one
NoSQL database types are employed together. In such case, multi-model NoSQL database can be
used which is designed to support multiple data models from four primary NoSQL data models.
The days when one DBMS was used to fit all needs are over. Now, a single application may use
several different data stores at the back-end.
14.5.7 Commercial NoSQL Databases
Commercial NoSQL databases started surfacing two years after the publication of Google’s
paper on BigTable in 2006 and Amazon’s paper on Dynamo in 2007. After these publications,
many initiatives were taken up both for open-source and close-source developments of NoSQL
databases. By the end of 2009, there were several releases including BigTable-inspired HBase,
Dynamo-inspired Riak and Cassandra. The following section briefs some of the popular
NoSQL databases.
14.5.7.1 Apache’s HBase
HBase is an Open-source NoSQL database system written in Java. It was developed by Apache
Software Foundation as part of their Hadoop project. HBase’s design architecture has been
inspired by Google’s internal storage system BigTable. As Google’s BigTable uses GFS, Hadoop’s
HBase uses HDFS as underlying file system. HBase is a column-oriented database management
system.
14.5.7.2 Amazon’s DynamoDB
DynamoDB is a key-value NoSQL database developed by Amazon. It derives its name from
Dynamo which is Amazon’s internal storage system and was launched in 2012. The database 253
service is fully-managed by Amazon and offered as part of the Amazon’s Web Services portfolio.
DynamoDB is useful specifically for supporting a large volume of concurrent updates and suits
well for shopping-cart like operations.
14.5.7.3 Apache’s Cassandra
Cassandra is an open-source NoSQL database management system developed in Java. It was
initially developed at Facebook and then was released as an open-source project in 2008 with
the goal of further advancements. Although Facebook’s kingdom was largely dependent on
Cassandra, they still released it as an open-source project, possibly having assured on that it might
be too late for others to use the technology to knock its castle down. Cassandra became an Apache
Incubator project in 2009. Cassandra is a hybrid of column-oriented and key-value data store
being suitable to be deployed over both across many commodity servers and cloud infrastructure.
14.5.7.4 Google Cloud Datastore
Cloud Datastore is developed by Google and is available as a fully-managed NoSQL database
service. Cloud Datastore is very easy to use and supports SQL-like queries being called as GQL.
The Datastore is a NoSQL key-value database where users can store data as key-value pairs.
Cloud Datastore also supports ACID transactions using optimistic concurrency control.
14.5.7.5 MongoDB
MongoDB is a popular document-oriented open-source NoSQL database. It is developed by
New York City-based MongoDB Inc. and was first released as a product in 2009. It is written
in C++, JavaScript and C programming languages and uses GridFS as built-in distributed file
system. MongoDB runs well on many cloud based environments including Amazon EC2.
14.5.7.6 Amazon’s SimpleDB
SimpleDB is a fully-managed NoSQL data store offered by Amazon. It is a key-value store and
actually not a full database implementation. SimpleDB was first announced on December 2007
and works with both Amazon EC2 and Amazon S3.
14.5.7.7 Apache’s CouchDB
CouchDB is an open-source document-oriented NoSQL database. CouchDB was first developed
in 2005 by a former developer of IBM. Later in 2008, it was adopted as an Apache Incubator
project. Soon in 2010, the first stable version of CouchDB was released and it became popular.
14.5.7.8 Neo4j
Neo4j is an open-source graph database. It is developed in Java. Neo4j was developed by Neo
Technology of United States and was initially released in 2007. But its stable versions started
appearing from the year of 2010.254
Apart from these few NoSQL databases as mentioned above, there are numerous of other
products available in the market. New developments are happening in this field and many
new products are being launched. Database management application giant like Oracle has also
launched their own NoSQL database. Hence, much more of new advancements are expected in
this domain in coming years.
Non-relational database system is another unique offering in the field of data-intensive
computing. Using this database model, the new age non-relational data-sets can be processed
more efficiently than the traditional database with relational systems. The new age high
performance computing environments like cloud computing systems are heavily dependent
on these non-relational database systems for efficient storage and retrieval of data. This section
focuses on the need of such a database, discusses its storage architecture and also briefs few
such popular databases.
14.5.1 Emergence of Large Volume of Unstructured Data-sets
Earlier data managed by enterprise applications were structured in nature and less in volume.
But with the introduction of web based portals during the end of last century, the nature of
web content or data started changing. Volume of data started to grow exponentially and data
became unstructured in character. Such data-sets were classified later and their characteristics
were identified. This type of data or data-set is referred as ‘Big data’.
14.5.1.1 Big data
Big data is used to describe both structured and unstructured data that is massive in volume.
It also considers data those are too diverse in nature and highly dynamic (very fast-changing).
Differently put, the new age data whose volume, velocity or variety is too great are termed as
Big data. Three said characteristics of Big data are described below.
Volume: A typical PC probably had 10 gigabytes of storage in the year of 2000. During
that time, excessive data volume was a storage issue as storage was not so cheap
like today. Today social networking sites use to generate few thousand terabytes of data
every day. 244
Cloud Computing
Velocity: Data streaming nowadays are happening at unprecedented rate as well as with
speed. So things must be dealt in a timely manner. Quick response to customers’ action is
a business challenge for any organization.
Variety: Data of all formats are important today. Structured or unstructured texts, audio,
video, image, 3D data and others are all being produced every day.
The above characteristics cause variability and complexity in terms of managing big data.
Variability in the sense that data flow can be highly inconsistent with periodic peaks as they are
in social media or in e-commerce portals. The complexity often comes with it when it becomes
difficult to connect and correlate data or define their hierarchies.
It is to be noted that big data is not only about the volume of data; rather it considers other
characteristics of new age data, like their variety or speed of generation.
14.5.2 Time Appeared for an Alternative Database Model
Since the emergence of relational database, enterprise applications started using it since 1980s.
Those relational database systems were developed to store and process structured data-sets.
But the database system started facing challenge as the volume of data started increasing
exponentially from the end of the last century and the situation worsened after the introduction
of web based social networking and e-commerce portals. Soon, the concept of big data emerged.
Online Transaction Processing (OLTP) applications flooded the web with very high volume
of data from the beginning of the current century. These applications needed to function under
stiff latency constraints to provide consistent performance to a very large number of users as
hundreds of millions of clients throughout the world were accessing such applications. These
sites were experiencing massive variations in traffic also. Some of these hikes were due to
predictable events like New Year, business release or sporting event, but most of others were
unpredictable events which becomes more difficult to manage. Data were being accessed
more frequently and needed to be processed more intensively.
Relational databases are appropriate for a wide range of tasks but not for every task.
The basic operations at any database are read and write. Read operations can be scaled by
distributing and replicating data to multiple servers. But inconsistency in data may happen
when write or update operation takes place. And with the new age data, the number of writers
are often much larger than the number of readers, especially in popular social networking sites.
One solution to this problem is to exclusively partition the data during distribution. But with
that also the distributed unions (of data from database tables) may become slower and harder
to implement if the underlying storage architecture is not supported for doing so.
Here, the main problem was that the traditional SQL databases with relational systems
do not scale well. Traditional DBMSs can only ‘scale up’ (vertical scaling) or increase the
resources on a central server. But, efficient processing of big data require an excellent ‘scale out’
(horizontal scaling) capability.245
Database Technology
Web applications were moving towards cloud computing model, and it did not take very long
to the pioneer of cloud computing services like Google, Amazon, and many other e-commerce
and social networking companies as well as technologists to realize that traditional relational
databases are no more enough for handling the new age data. They started to look for a suitable
database solution.
Traditional SQL databases do not fit well with the concept of horizontal scaling and horizontal
scalability is the only way to scale them indefinitely.
14.5.2.1 Modern Age Database Requirements
Horizontal scaling appeared as one of the necessary attributes of database system to keep pace
with the processing needs of large data-sets. It appeared impossible to deliver high-performance
without distributing those data among multiple nodes and processing them in parallel. The other
major concern was the latency associated with transactions. This latency could be reduced by
caching frequently-used data in-memory on dedicated servers, instead of fetching them every
time required. These facilities had to be incorporated in the new age database systems to reduce
the response time and enhance the performance of applications. The databases had to be highly
optimized for simple retrieval and appending operations. These things, along with many other
issues, worked as the driving forces behind the development of an alternative database system.
14.5.2.2 Role of Cloud Storage System
The characteristics of storage system had changed during this time. From the earlier concern
regarding cost of storage space, the cost of storage management was gradually becoming the
dominant element of storage systems. That opened the opportunity for replication of files into
storage across different geographic locations and hence the uses of distributed file systems
became widespread.
In such a scenario, the evolution of storage strategy started introducing many different
models of distributed file systems like General Parallel File System (GPFS), Google File System
(GFS) or Hadoop Distributed File System (HDFS) and else. All of these works well in high
performance computing environments. Characteristics of such file systems and their storage
strategy suited well with cloud’s dynamic architecture. This created opportunity of developing
scalable database systems (over these file systems) to store and manage the modern age data.
Cloud native databases are facilitated by distributed storage systems and they are closely
associated with one another. Hence, the storage and database system often overlaps.
14.5.3 NoSQL DBMS
NoSQL is a class of database management system that does not follow all of the rules of a
relational DBMS. The term NoSQL can be interpreted as ‘Not Only SQL’ as it is not a replacement
but rather it is a complementary addition to RDBMS. This class of database uses some SQL 246
Cloud Computing
like query languages to make queries but does not use the traditional SQL (structured query
language).
The term NoSQL was coined by Carlo Strozzi in the year of 1998 to name the file-based
open-source relational database he was going to develop which did not have an SQL interface.
However, this initial usage of the term NoSQL is not directly linked with the NoSQL being
used at present. The term drew attention in 2009 when Eric Evans (an employee of a cloud
hosting company, Rackspace) used it in a conference to represent the surge of developing non
relational distributed databases then.
NoSQL is not against SQL and it was developed to handle unstructured big data in an efficient
way to provide maximum business value.
14.5.3.1 The Evolution
The NoSQL movement slowly started in the early years of current century as the IT industry
started to realize the need of new database system in order to support web-based applications.
The initial advances got its space when computing majors Google and Amazon published two
papers successively in 2006 and 2007.
14.5.3.2 The BigTable Revolution
In 2004, Google employed a team to develop a storage system to manage Big data. BigTable is
outcome of that. It is a proprietary distributed storage system built by Google on GFS and is
in use from 2005. The storage system was built to manage large structured data-sets and
was designed to scale to a very large size. It is structured as large table which may be peta
bytes in size and distributed among tens of thousands of machines. BigTable has successfully
provided a flexible, high-performance solution for Google products like Google Earth, Google
Analytics and Orkut.
Later, this BigTable has had a large impact on NoSQL database design when Google publicly
disclosed the details of it in a technical paper in 2006. This opened the scope to the technologists
for an Open-source development of BigTable like database. Thus, HBase database developed
by Apache Foundation and Cassandra developed at Facebook were surfaced in the market.
Meanwhile, during all of these developments, Amazon also published a paper on their Dynamo
storage system in 2007 which was also built to address the challenges of working with big data.
Big Table, although built as storage, resembles database system in many ways. It also shares
many implementation strategies of database technologies.
The NoSQL database development process remained closely associated with the developments
in the field of cloud native file systems (or, cloud storage systems) during those days. Soon, many
other players of web services started working on the technology and in a short period of time,
starting around the year of 2008, all of these developments became the source of a technology
revolutions. The NoSQL database became prominent after 2009 as the general terminology
‘NoSQL’ was adopted to set apart these new databases or more correctly for the file systems.247
Database Technology
NoSQL database development has been closely associated with scalable file system
development in computing.
14.5.3.3 CAP Theorem
The abbreviation CAP stands for Consistency, Availability and Partition tolerance of data.
CAP theorem (also known as Brewer’s theorem) says that it is impossible for a distributed
computer system to meet all of three aspects of CAP simultaneously. Eric Brewer of University
of California, Berkeley presented the theorem in the ACM (Association of Computing
Machinery) conference in 2000.
â– Consistency: This means that data in the database remains consistent after execution of an
operation. For example, once a data is written or updated, all of the future read requests will
see that data.
â– Availability: It guarantees that the database always remains available without any downtime.
â– Partition tolerance: Here the database should be partitioned in such a way that if one part of
the database becomes unavailable, other parts remain unaffected and can function properly.
This ensures availability of information.
Any database system must follow this ‘two-of-three’ philosophy. Thus, the relational database
which focuses highly on consistency issue sacrifices the ‘partition tolerance’ attribute of CAP
(Figure 14.1). It is already discussed that one of the primary goals of NoSQL systems is to
boost horizontal scalability. To scale horizontally, a system needs strong network partition
tolerance which needs to give up either ‘consistency’ or ‘availability’ attribute of CAP. Thus, all
of the NoSQL databases follow either combinations of CP (consistency-partition tolerance) or
AP (availability-partition tolerance) from the attributes of the CAP theorem. This means some
of the NoSQL databases even drops consistency as an essential attribute. For example, while
HBase maintains CP criteria the other popular database Cassandra maintains AP criteria.
Some of the NoSQL databases even choose to relax the ‘consistency’ issue from the CAP
criteria and this philosophy suits well in certain distributed applications.
Different combinations of CAP criteria are to serve different kinds of requirements. Database
designers analyze specific data processing requirements before choosing one.
CA: It is suitable for systems being designed to run over cluster on a single site so that all
of the nodes always remain in contact. Hence, the worry of network partitioning problem
almost disappears. But, if partition occurs, the system fails.
CP: This model is tolerant to network partitioning problem, but suitable for systems where
24 × 7 availability is not a critical issue. Some data may become inaccessible for a while but
the rest remains consistent or accurate.
AP: This model is also tolerant to network partitioning problem as partitions are designed
to work independently. 24 × 7 availability of data is also assured but sometimes some of the
data returned may be inaccurate.
Partition tolerance is an essential criteria for NoSQL databases as one of their primary goals is
the horizontal scalability.
14.5.3.4 BASE Theorem
Relational database system treats consistency and availability issues as essential criteria.
Fulfillments of these criteria are ensured by following the ACID (Atomicity, Consistency,
Isolation and Durability) properties in RDBMS. NoSQL database tackles the consistency issue
in a different way. It is not so stringent on consistency issue; rather it focuses on partition
tolerance and availability. Hence, NoSQL database no more need to follow the ACID rule.
NoSQL database should be much easier to scale out (horizontal scaling) and capable of
handling large volume of unstructured data. To achieve these, NoSQL databases usually follow
BASE principle which stands for ‘Basically Available, Soft state, Eventual consistency’. The
BASE theorem was also defined by Eric Brewer who is known for formulating the CAP theorem.
The three criteria of BASE are explained below:
â– Basically Available: This principle states that data should remain available even in the
presence of multiple node failures. This is achieved by using a highly-distributed approach
with multiple replications in the database management.
â– Eventual Consistency: This principle states that immediately after operation, data may look
like inconsistent but ultimately they should converge to a consistent state in future. For
example, two users querying for same data immediately after a transaction (on that data)
may get different values. But finally, the consistency will be regained.249
Database Technology
â– Soft State: The eventual consistency model allows the database to be inconsistent for some
time. But to bring it back to consistent state, the system should allow change in state over
time even without any input. This is known as Soft state of system.
BASE does not address the consistency issue. The AP region of Figure 14.1 follows the BASE
theory. The idea behind this is that data consistency is application developer’s problem and
should be handled by developer through appropriate programming techniques. Database will
no more handle the consistency issue. This philosophy helps to achieve the scalability goal.
To satisfy the scalability and data distribution demands in NoSQL, it was no longer possible to
meet all the four criteria of ACID simultaneously. Hence, BASE theorem was proposed as an
alternative.
14.5.4 Features of NoSQL Database
NoSQL database introduces many new features in comparison with relational databases. Few
of those features oppose relational-DBMS concept. They can be listed as schema-free, non
relational, horizontally scalable and distributed.
14.5.4.1 Flexible Schemas
Relational database system cannot address data whose structure is not known in advance. They
need to define the schema of the database and tables before storing any data in it. But, with this
schema-based design, it becomes difficult to manage agile data sets. When at the middle of the
business, it needs to introduce a new field (column) in some table, then it becomes extremely
disruptive as that require alteration of the schema. This is a very slow process and involves
significant downtime.
NoSQL databases are designed to allow insertion of data without a pre-defined schema. This
makes it very easy to incorporate real-time changes in application at the time of requirement as
that does not cause service interruption.
Unlike relational database, NoSQL database is schema-free.
14.5.4.2 Non-relational
NoSQL database can manage non-relational data efficiently along with relational data. The
relational constraints of RDBMS are not applicable in this database. This makes it easier to
manage non-relational data using NoSQL database.
14.5.4.3 Scalability
Relational databases are designed to scale vertically. But vertical scaling has its own
limitations as it does not allow new servers to be introduced into the system to share the load. 250
Cloud Computing
Horizontal scalability is the only way to scale indefinitely and that is also cheaper than vertical
scaling. NoSQL database is designed to scale horizontally with minimum effort.
14.5.4.4 Auto-distribution
Distributed relational databases allow fragmentation and distribution of a database across
multiple servers. But, that does not happen automatically as it is a manual process to be handled
by application making it difficult to manage. On the other hand, the distribution happens
automatically in NoSQL databases. Application developers need not to worry about anything.
All of these distribution and load balancing acts are automated in the database itself.
Distribution and replication of data segments are not inherent features of relational database;
these are responsibilities of application developers. In NoSQL, these happen automatically.
14.5.4.5 Auto-replication
Not only fragmentation and distribution, replication of database fragments are also an
automatic process in NoSQL. No external programming is required to replicate fragments
across multiple servers. Replication ensures high availability of data and supports recovery.
14.5.4.6 Integrated Caching
NoSQL database often provides integrated caching capability. This feature reduces latency
and increases throughput by keeping frequently-used data in system memory as much as
possible. In relational database, a separate caching layer needs to be maintained to achieve this
performance goal.
But one thing needs to be added here that although NoSQL database offers many advantages
over relational database, it fails to provide the rich reporting and analytical functionality like
RDBMS in some specific scenarios.
Despite many benefits, NoSQL fails to provide the rich analytical functionality in specific cases
as RDBMS serves.
14.5.5 NoSQL Database Types
There are four different types of NoSQL databases. Each of them is designed to address the need of
some particular classes of problems. Various NoSQL database service providers try to offer solution
for different types of problems. Following sections describe four different NoSQL database types.
14.5.5.1 Key-Value Database
The Key-Value Database (or KV Store) is the simplest among the various NoSQL databases.
It pairs up data with a key and maintains the database like a hash-table where data values are 251
Database Technology
referred by the keys. The main benefit of such pairing makes it easily scalable. However, it is not
suitable where queries are based on the value rather than on the key. Amazon’s DynamoDB,
Azure Table Storage and CouchDB are few popular examples of this type of NoSQL databases.
14.5.5.2 Document-Oriented Database
A Document Oriented Database (or Document Store) is an application where data is stored in
documents. It is similar to the Key-Value stores with the values stored in structured documents.
The documents are addressed and can be retrieved from the database using key. This key can
be a path, a URI or a simple string.
The documents are schema-free and can be of any format as long as the database application
can understand its internal structure. Generally document-oriented databases use some of the
XML, JSON (JavaScript Object Notation) or Binary JSON (BSON) formats.
One document can be referred by multiple keys and a document can refer to other
documents by storing their keys. But each document is treated as stand-alone and there is no
constraint to enforce relational integrity. MongoDB, Apache CouchDB, Couchbase are few
popular examples of document-oriented databases.
14.5.5.3 Column-Family Database
A Column-Family Database (or Wide-Column Data Store/Column Store) stores data grouped
in columns. Each column consists of three elements as name, value and a time-stamp. Name
is used to refer the column and time-stamp is used to identify actual required content. For
example, the time-stamp is useful in finding up-to-date content. A similar type of columns
together forms a column family which are often accessed together. A column family can
contain virtually unlimited number of columns.
In relational databases, each row is stored as a continuous disk entry. Different rows may
get stored in different places on disk. Contrary to this, in column-family database, all of the
cells corresponding to a column are stored as a continuous disk entry. This makes the access
of data faster. For example, searching of a particular title from a record of million books
stored in relational data model is an intense task as that will cause millions of accesses to disk.
On the other hand, using column-family data model, the title can be found with single disk
access only.
The difference between column stores and key-value stores is that column stores are
optimized to handle data along columns. Column stores show better analytical power and
provide improved performance by imposing a certain amount of rigidity to a database schema.
In some ways, the column stores are an intermediate solution between traditional RDBMSs and
key-value stores. Hadoop’s Hbase is the best example of popular column store-based database.
14.5.5.4 Graph Database
In the Graph Database (or Graph Store) data is stored as graph structures using nodes and
edges. The entities are represented as nodes and the relationship between entities as edges.
Graph database follows index-free adjacency where every node directly points to its adjacent
nodes. In this set up, the cost of a hop or tour remains same as the number of nodes increases. 252
Cloud Computing
This is useful to store information about relationships when number of elements are huge, such
as social connections. Twitter uses such database to track who is following whom. Examples of
popular graph-based databases include Neo4J, Info-Grid, Infinite Graph and few others.
14.5.6 Selecting the Suitable NoSQL Database Solution
Each of the NoSQL database types has its own strength and weaknesses. They are designed to
serve different kind of data storage requirements and hence are not comparable to each other.
The ‘one-size-fits-all’ philosophy of relational databases is not applicable in NoSQL database
domain. Here the users have the flexibility of choosing of multiple options after analyzing the
requirements of their applications.
Selecting the NoSQL database strategy is not a one-time decision. First, one will have
to identify the requirements of an application which are not met by relational database
systems. Then suitable NoSQL database solution has to be identified to meet those unfulfilled
requirements. Even more than one NoSQL database types may be used to meet all of these
necessities.
Sometime a single application may provide optimized performance when more than one
NoSQL database types are employed together. In such case, multi-model NoSQL database can be
used which is designed to support multiple data models from four primary NoSQL data models.
The days when one DBMS was used to fit all needs are over. Now, a single application may use
several different data stores at the back-end.
14.5.7 Commercial NoSQL Databases
Commercial NoSQL databases started surfacing two years after the publication of Google’s
paper on BigTable in 2006 and Amazon’s paper on Dynamo in 2007. After these publications,
many initiatives were taken up both for open-source and close-source developments of NoSQL
databases. By the end of 2009, there were several releases including BigTable-inspired HBase,
Dynamo-inspired Riak and Cassandra. The following section briefs some of the popular
NoSQL databases.
14.5.7.1 Apache’s HBase
HBase is an Open-source NoSQL database system written in Java. It was developed by Apache
Software Foundation as part of their Hadoop project. HBase’s design architecture has been
inspired by Google’s internal storage system BigTable. As Google’s BigTable uses GFS, Hadoop’s
HBase uses HDFS as underlying file system. HBase is a column-oriented database management
system.
14.5.7.2 Amazon’s DynamoDB
DynamoDB is a key-value NoSQL database developed by Amazon. It derives its name from
Dynamo which is Amazon’s internal storage system and was launched in 2012. The database 253
Database Technology
service is fully-managed by Amazon and offered as part of the Amazon’s Web Services portfolio.
DynamoDB is useful specifically for supporting a large volume of concurrent updates and suits
well for shopping-cart like operations.
14.5.7.3 Apache’s Cassandra
Cassandra is an open-source NoSQL database management system developed in Java. It was
initially developed at Facebook and then was released as an open-source project in 2008 with
the goal of further advancements. Although Facebook’s kingdom was largely dependent on
Cassandra, they still released it as an open-source project, possibly having assured on that it might
be too late for others to use the technology to knock its castle down. Cassandra became an Apache
Incubator project in 2009. Cassandra is a hybrid of column-oriented and key-value data store
being suitable to be deployed over both across many commodity servers and cloud infrastructure.
14.5.7.4 Google Cloud Datastore
Cloud Datastore is developed by Google and is available as a fully-managed NoSQL database
service. Cloud Datastore is very easy to use and supports SQL-like queries being called as GQL.
The Datastore is a NoSQL key-value database where users can store data as key-value pairs.
Cloud Datastore also supports ACID transactions using optimistic concurrency control.
14.5.7.5 MongoDB
MongoDB is a popular document-oriented open-source NoSQL database. It is developed by
New York City-based MongoDB Inc. and was first released as a product in 2009. It is written
in C++, JavaScript and C programming languages and uses GridFS as built-in distributed file
system. MongoDB runs well on many cloud based environments including Amazon EC2.
14.5.7.6 Amazon’s SimpleDB
SimpleDB is a fully-managed NoSQL data store offered by Amazon. It is a key-value store and
actually not a full database implementation. SimpleDB was first announced on December 2007
and works with both Amazon EC2 and Amazon S3.
14.5.7.7 Apache’s CouchDB
CouchDB is an open-source document-oriented NoSQL database. CouchDB was first developed
in 2005 by a former developer of IBM. Later in 2008, it was adopted as an Apache Incubator
project. Soon in 2010, the first stable version of CouchDB was released and it became popular.
14.5.7.8 Neo4j
Neo4j is an open-source graph database. It is developed in Java. Neo4j was developed by Neo
Technology of United States and was initially released in 2007. But its stable versions started
appearing from the year of 2010.254
Cloud Computing
Apart from these few NoSQL databases as mentioned above, there are numerous of other
products available in the market. New developments are happening in this field and many
new products are being launched. Database management application giant like Oracle has also
launched their own NoSQL database. Hence, much more of new advancements are expected in
this domain in coming years.