NON-RELATIONAL DBMS IN CLOUD

Non-relational database system is another unique offering in the field of data-intensive

computing. Using this database model, the new age non-relational data-sets can be processed

more efficiently than the traditional database with relational systems. The new age high

performance computing environments like cloud computing systems are heavily dependent

on these non-relational database systems for efficient storage and retrieval of data. This section

focuses on the need of such a database, discusses its storage architecture and also briefs few

such popular databases.

14.5.1 Emergence of Large Volume of Unstructured Data-sets

Earlier data managed by enterprise applications were structured in nature and less in volume.

But with the introduction of web based portals during the end of last century, the nature of

web content or data started changing. Volume of data started to grow exponentially and data

became unstructured in character. Such data-sets were classified later and their characteristics

were identified. This type of data or data-set is referred as ‘Big data’.

14.5.1.1 Big data

Big data is used to describe both structured and unstructured data that is massive in volume.

It also considers data those are too diverse in nature and highly dynamic (very fast-changing).

Differently put, the new age data whose volume, velocity or variety is too great are termed as

Big data. Three said characteristics of Big data are described below.

Volume: A typical PC probably had 10 gigabytes of storage in the year of 2000. During

that time, excessive data volume was a storage issue as storage was not so cheap

like today. Today social networking sites use to generate few thousand terabytes of data

every day. 244

Cloud Computing

Velocity: Data streaming nowadays are happening at unprecedented rate as well as with

speed. So things must be dealt in a timely manner. Quick response to customers’ action is

a business challenge for any organization.

Variety: Data of all formats are important today. Structured or unstructured texts, audio,

video, image, 3D data and others are all being produced every day.

The above characteristics cause variability and complexity in terms of managing big data.

Variability in the sense that data flow can be highly inconsistent with periodic peaks as they are

in social media or in e-commerce portals. The complexity often comes with it when it becomes

difficult to connect and correlate data or define their hierarchies.

It is to be noted that big data is not only about the volume of data; rather it considers other

characteristics of new age data, like their variety or speed of generation.

14.5.2 Time Appeared for an Alternative Database Model

Since the emergence of relational database, enterprise applications started using it since 1980s.

Those relational database systems were developed to store and process structured data-sets.

But the database system started facing challenge as the volume of data started increasing

exponentially from the end of the last century and the situation worsened after the introduction

of web based social networking and e-commerce portals. Soon, the concept of big data emerged.

Online Transaction Processing (OLTP) applications flooded the web with very high volume

of data from the beginning of the current century. These applications needed to function under

stiff latency constraints to provide consistent performance to a very large number of users as

hundreds of millions of clients throughout the world were accessing such applications. These

sites were experiencing massive variations in traffic also. Some of these hikes were due to

predictable events like New Year, business release or sporting event, but most of others were

unpredictable events which becomes more difficult to manage. Data were being accessed

more frequently and needed to be processed more intensively.

Relational databases are appropriate for a wide range of tasks but not for every task.

The basic operations at any database are read and write. Read operations can be scaled by

distributing and replicating data to multiple servers. But inconsistency in data may happen

when write or update operation takes place. And with the new age data, the number of writers

are often much larger than the number of readers, especially in popular social networking sites.

One solution to this problem is to exclusively partition the data during distribution. But with

that also the distributed unions (of data from database tables) may become slower and harder

to implement if the underlying storage architecture is not supported for doing so.

Here, the main problem was that the traditional SQL databases with relational systems

do not scale well. Traditional DBMSs can only ‘scale up’ (vertical scaling) or increase the

resources on a central server. But, efficient processing of big data require an excellent ‘scale out’

(horizontal scaling) capability.245

Database Technology

Web applications were moving towards cloud computing model, and it did not take very long

to the pioneer of cloud computing services like Google, Amazon, and many other e-commerce

and social networking companies as well as technologists to realize that traditional relational

databases are no more enough for handling the new age data. They started to look for a suitable

database solution.

Traditional SQL databases do not fit well with the concept of horizontal scaling and horizontal

scalability is the only way to scale them indefinitely.

14.5.2.1 Modern Age Database Requirements

Horizontal scaling appeared as one of the necessary attributes of database system to keep pace

with the processing needs of large data-sets. It appeared impossible to deliver high-performance

without distributing those data among multiple nodes and processing them in parallel. The other

major concern was the latency associated with transactions. This latency could be reduced by

caching frequently-used data in-memory on dedicated servers, instead of fetching them every

time required. These facilities had to be incorporated in the new age database systems to reduce

the response time and enhance the performance of applications. The databases had to be highly

optimized for simple retrieval and appending operations. These things, along with many other

issues, worked as the driving forces behind the development of an alternative database system.

14.5.2.2 Role of Cloud Storage System

The characteristics of storage system had changed during this time. From the earlier concern

regarding cost of storage space, the cost of storage management was gradually becoming the

dominant element of storage systems. That opened the opportunity for replication of files into

storage across different geographic locations and hence the uses of distributed file systems

became widespread.

In such a scenario, the evolution of storage strategy started introducing many different

models of distributed file systems like General Parallel File System (GPFS), Google File System

(GFS) or Hadoop Distributed File System (HDFS) and else. All of these works well in high

performance computing environments. Characteristics of such file systems and their storage

strategy suited well with cloud’s dynamic architecture. This created opportunity of developing

scalable database systems (over these file systems) to store and manage the modern age data.

Cloud native databases are facilitated by distributed storage systems and they are closely

associated with one another. Hence, the storage and database system often overlaps.

14.5.3 NoSQL DBMS

NoSQL is a class of database management system that does not follow all of the rules of a

relational DBMS. The term NoSQL can be interpreted as ‘Not Only SQL’ as it is not a replacement

but rather it is a complementary addition to RDBMS. This class of database uses some SQL 246

Cloud Computing

like query languages to make queries but does not use the traditional SQL (structured query

language).

The term NoSQL was coined by Carlo Strozzi in the year of 1998 to name the file-based

open-source relational database he was going to develop which did not have an SQL interface.

However, this initial usage of the term NoSQL is not directly linked with the NoSQL being

used at present. The term drew attention in 2009 when Eric Evans (an employee of a cloud

hosting company, Rackspace) used it in a conference to represent the surge of developing non

relational distributed databases then.

NoSQL is not against SQL and it was developed to handle unstructured big data in an efficient

way to provide maximum business value.

14.5.3.1 The Evolution

The NoSQL movement slowly started in the early years of current century as the IT industry

started to realize the need of new database system in order to support web-based applications.

The initial advances got its space when computing majors Google and Amazon published two

papers successively in 2006 and 2007.

14.5.3.2 The BigTable Revolution

In 2004, Google employed a team to develop a storage system to manage Big data. BigTable is

outcome of that. It is a proprietary distributed storage system built by Google on GFS and is

in use from 2005. The storage system was built to manage large structured data-sets and

was designed to scale to a very large size. It is structured as large table which may be peta

bytes in size and distributed among tens of thousands of machines. BigTable has successfully

provided a flexible, high-performance solution for Google products like Google Earth, Google

Analytics and Orkut.

Later, this BigTable has had a large impact on NoSQL database design when Google publicly

disclosed the details of it in a technical paper in 2006. This opened the scope to the technologists

for an Open-source development of BigTable like database. Thus, HBase database developed

by Apache Foundation and Cassandra developed at Facebook were surfaced in the market.

Meanwhile, during all of these developments, Amazon also published a paper on their Dynamo

storage system in 2007 which was also built to address the challenges of working with big data.

Big Table, although built as storage, resembles database system in many ways. It also shares

many implementation strategies of database technologies.

The NoSQL database development process remained closely associated with the developments

in the field of cloud native file systems (or, cloud storage systems) during those days. Soon, many

other players of web services started working on the technology and in a short period of time,

starting around the year of 2008, all of these developments became the source of a technology

revolutions. The NoSQL database became prominent after 2009 as the general terminology

‘NoSQL’ was adopted to set apart these new databases or more correctly for the file systems.247

Database Technology

NoSQL database development has been closely associated with scalable file system

development in computing.

14.5.3.3 CAP Theorem

The abbreviation CAP stands for Consistency, Availability and Partition tolerance of data.

CAP theorem (also known as Brewer’s theorem) says that it is impossible for a distributed

computer system to meet all of three aspects of CAP simultaneously. Eric Brewer of University

of California, Berkeley presented the theorem in the ACM (Association of Computing

Machinery) conference in 2000.

■ Consistency: This means that data in the database remains consistent after execution of an

operation. For example, once a data is written or updated, all of the future read requests will

see that data.

■ Availability: It guarantees that the database always remains available without any downtime.

■ Partition tolerance: Here the database should be partitioned in such a way that if one part of

the database becomes unavailable, other parts remain unaffected and can function properly.

This ensures availability of information.

Any database system must follow this ‘two-of-three’ philosophy. Thus, the relational database

which focuses highly on consistency issue sacrifices the ‘partition tolerance’ attribute of CAP

(Figure 14.1). It is already discussed that one of the primary goals of NoSQL systems is to

boost horizontal scalability. To scale horizontally, a system needs strong network partition

tolerance which needs to give up either ‘consistency’ or ‘availability’ attribute of CAP. Thus, all

of the NoSQL databases follow either combinations of CP (consistency-partition tolerance) or

AP (availability-partition tolerance) from the attributes of the CAP theorem. This means some

of the NoSQL databases even drops consistency as an essential attribute. For example, while

HBase maintains CP criteria the other popular database Cassandra maintains AP criteria.

Some of the NoSQL databases even choose to relax the ‘consistency’ issue from the CAP

criteria and this philosophy suits well in certain distributed applications.

Different combinations of CAP criteria are to serve different kinds of requirements. Database

designers analyze specific data processing requirements before choosing one.

CA: It is suitable for systems being designed to run over cluster on a single site so that all

of the nodes always remain in contact. Hence, the worry of network partitioning problem

almost disappears. But, if partition occurs, the system fails.

CP: This model is tolerant to network partitioning problem, but suitable for systems where

24 × 7 availability is not a critical issue. Some data may become inaccessible for a while but

the rest remains consistent or accurate.

AP: This model is also tolerant to network partitioning problem as partitions are designed

to work independently. 24 × 7 availability of data is also assured but sometimes some of the

data returned may be inaccurate.

Partition tolerance is an essential criteria for NoSQL databases as one of their primary goals is

the horizontal scalability.

14.5.3.4 BASE Theorem

Relational database system treats consistency and availability issues as essential criteria.

Fulfillments of these criteria are ensured by following the ACID (Atomicity, Consistency,

Isolation and Durability) properties in RDBMS. NoSQL database tackles the consistency issue

in a different way. It is not so stringent on consistency issue; rather it focuses on partition

tolerance and availability. Hence, NoSQL database no more need to follow the ACID rule.

NoSQL database should be much easier to scale out (horizontal scaling) and capable of

handling large volume of unstructured data. To achieve these, NoSQL databases usually follow

BASE principle which stands for ‘Basically Available, Soft state, Eventual consistency’. The

BASE theorem was also defined by Eric Brewer who is known for formulating the CAP theorem.

The three criteria of BASE are explained below:

■ Basically Available: This principle states that data should remain available even in the

presence of multiple node failures. This is achieved by using a highly-distributed approach

with multiple replications in the database management.

■ Eventual Consistency: This principle states that immediately after operation, data may look

like inconsistent but ultimately they should converge to a consistent state in future. For

example, two users querying for same data immediately after a transaction (on that data)

may get different values. But finally, the consistency will be regained.249

Database Technology

■ Soft State: The eventual consistency model allows the database to be inconsistent for some

time. But to bring it back to consistent state, the system should allow change in state over

time even without any input. This is known as Soft state of system.

BASE does not address the consistency issue. The AP region of Figure 14.1 follows the BASE

theory. The idea behind this is that data consistency is application developer’s problem and

should be handled by developer through appropriate programming techniques. Database will

no more handle the consistency issue. This philosophy helps to achieve the scalability goal.

To satisfy the scalability and data distribution demands in NoSQL, it was no longer possible to

meet all the four criteria of ACID simultaneously. Hence, BASE theorem was proposed as an

alternative.

14.5.4 Features of NoSQL Database

NoSQL database introduces many new features in comparison with relational databases. Few

of those features oppose relational-DBMS concept. They can be listed as schema-free, non

relational, horizontally scalable and distributed.

14.5.4.1 Flexible Schemas

Relational database system cannot address data whose structure is not known in advance. They

need to define the schema of the database and tables before storing any data in it. But, with this

schema-based design, it becomes difficult to manage agile data sets. When at the middle of the

business, it needs to introduce a new field (column) in some table, then it becomes extremely

disruptive as that require alteration of the schema. This is a very slow process and involves

significant downtime.

NoSQL databases are designed to allow insertion of data without a pre-defined schema. This

makes it very easy to incorporate real-time changes in application at the time of requirement as

that does not cause service interruption.

Unlike relational database, NoSQL database is schema-free.

14.5.4.2 Non-relational

NoSQL database can manage non-relational data efficiently along with relational data. The

relational constraints of RDBMS are not applicable in this database. This makes it easier to

manage non-relational data using NoSQL database.

14.5.4.3 Scalability

Relational databases are designed to scale vertically. But vertical scaling has its own

limitations as it does not allow new servers to be introduced into the system to share the load. 250

Cloud Computing

Horizontal scalability is the only way to scale indefinitely and that is also cheaper than vertical

scaling. NoSQL database is designed to scale horizontally with minimum effort.

14.5.4.4 Auto-distribution

Distributed relational databases allow fragmentation and distribution of a database across

multiple servers. But, that does not happen automatically as it is a manual process to be handled

by application making it difficult to manage. On the other hand, the distribution happens

automatically in NoSQL databases. Application developers need not to worry about anything.

All of these distribution and load balancing acts are automated in the database itself.

Distribution and replication of data segments are not inherent features of relational database;

these are responsibilities of application developers. In NoSQL, these happen automatically.

14.5.4.5 Auto-replication

Not only fragmentation and distribution, replication of database fragments are also an

automatic process in NoSQL. No external programming is required to replicate fragments

across multiple servers. Replication ensures high availability of data and supports recovery.

14.5.4.6 Integrated Caching

NoSQL database often provides integrated caching capability. This feature reduces latency

and increases throughput by keeping frequently-used data in system memory as much as

possible. In relational database, a separate caching layer needs to be maintained to achieve this

performance goal.

But one thing needs to be added here that although NoSQL database offers many advantages

over relational database, it fails to provide the rich reporting and analytical functionality like

RDBMS in some specific scenarios.

Despite many benefits, NoSQL fails to provide the rich analytical functionality in specific cases

as RDBMS serves.

14.5.5 NoSQL Database Types

There are four different types of NoSQL databases. Each of them is designed to address the need of

some particular classes of problems. Various NoSQL database service providers try to offer solution

for different types of problems. Following sections describe four different NoSQL database types.

14.5.5.1 Key-Value Database

The Key-Value Database (or KV Store) is the simplest among the various NoSQL databases.

It pairs up data with a key and maintains the database like a hash-table where data values are 251

Database Technology

referred by the keys. The main benefit of such pairing makes it easily scalable. However, it is not

suitable where queries are based on the value rather than on the key. Amazon’s DynamoDB,

Azure Table Storage and CouchDB are few popular examples of this type of NoSQL databases.

14.5.5.2 Document-Oriented Database

A Document Oriented Database (or Document Store) is an application where data is stored in

documents. It is similar to the Key-Value stores with the values stored in structured documents.

The documents are addressed and can be retrieved from the database using key. This key can

be a path, a URI or a simple string.

The documents are schema-free and can be of any format as long as the database application

can understand its internal structure. Generally document-oriented databases use some of the

XML, JSON (JavaScript Object Notation) or Binary JSON (BSON) formats.

One document can be referred by multiple keys and a document can refer to other

documents by storing their keys. But each document is treated as stand-alone and there is no

constraint to enforce relational integrity. MongoDB, Apache CouchDB, Couchbase are few

popular examples of document-oriented databases.

14.5.5.3 Column-Family Database

A Column-Family Database (or Wide-Column Data Store/Column Store) stores data grouped

in columns. Each column consists of three elements as name, value and a time-stamp. Name

is used to refer the column and time-stamp is used to identify actual required content. For

example, the time-stamp is useful in finding up-to-date content. A similar type of columns

together forms a column family which are often accessed together. A column family can

contain virtually unlimited number of columns.

In relational databases, each row is stored as a continuous disk entry. Different rows may

get stored in different places on disk. Contrary to this, in column-family database, all of the

cells corresponding to a column are stored as a continuous disk entry. This makes the access

of data faster. For example, searching of a particular title from a record of million books

stored in relational data model is an intense task as that will cause millions of accesses to disk.

On the other hand, using column-family data model, the title can be found with single disk

access only.

The difference between column stores and key-value stores is that column stores are

optimized to handle data along columns. Column stores show better analytical power and

provide improved performance by imposing a certain amount of rigidity to a database schema.

In some ways, the column stores are an intermediate solution between traditional RDBMSs and

key-value stores. Hadoop’s Hbase is the best example of popular column store-based database.

14.5.5.4 Graph Database

In the Graph Database (or Graph Store) data is stored as graph structures using nodes and

edges. The entities are represented as nodes and the relationship between entities as edges.

Graph database follows index-free adjacency where every node directly points to its adjacent

nodes. In this set up, the cost of a hop or tour remains same as the number of nodes increases. 252

Cloud Computing

This is useful to store information about relationships when number of elements are huge, such

as social connections. Twitter uses such database to track who is following whom. Examples of

popular graph-based databases include Neo4J, Info-Grid, Infinite Graph and few others.

14.5.6 Selecting the Suitable NoSQL Database Solution

Each of the NoSQL database types has its own strength and weaknesses. They are designed to

serve different kind of data storage requirements and hence are not comparable to each other.

The ‘one-size-fits-all’ philosophy of relational databases is not applicable in NoSQL database

domain. Here the users have the flexibility of choosing of multiple options after analyzing the

requirements of their applications.

Selecting the NoSQL database strategy is not a one-time decision. First, one will have

to identify the requirements of an application which are not met by relational database

systems. Then suitable NoSQL database solution has to be identified to meet those unfulfilled

requirements. Even more than one NoSQL database types may be used to meet all of these

necessities.

Sometime a single application may provide optimized performance when more than one

NoSQL database types are employed together. In such case, multi-model NoSQL database can be

used which is designed to support multiple data models from four primary NoSQL data models.

The days when one DBMS was used to fit all needs are over. Now, a single application may use

several different data stores at the back-end.

14.5.7 Commercial NoSQL Databases

Commercial NoSQL databases started surfacing two years after the publication of Google’s

paper on BigTable in 2006 and Amazon’s paper on Dynamo in 2007. After these publications,

many initiatives were taken up both for open-source and close-source developments of NoSQL

databases. By the end of 2009, there were several releases including BigTable-inspired HBase,

Dynamo-inspired Riak and Cassandra. The following section briefs some of the popular

NoSQL databases.

14.5.7.1 Apache’s HBase

HBase is an Open-source NoSQL database system written in Java. It was developed by Apache

Software Foundation as part of their Hadoop project. HBase’s design architecture has been

inspired by Google’s internal storage system BigTable. As Google’s BigTable uses GFS, Hadoop’s

HBase uses HDFS as underlying file system. HBase is a column-oriented database management

system.

14.5.7.2 Amazon’s DynamoDB

DynamoDB is a key-value NoSQL database developed by Amazon. It derives its name from

Dynamo which is Amazon’s internal storage system and was launched in 2012. The database 253

Database Technology

service is fully-managed by Amazon and offered as part of the Amazon’s Web Services portfolio.

DynamoDB is useful specifically for supporting a large volume of concurrent updates and suits

well for shopping-cart like operations.

14.5.7.3 Apache’s Cassandra

Cassandra is an open-source NoSQL database management system developed in Java. It was

initially developed at Facebook and then was released as an open-source project in 2008 with

the goal of further advancements. Although Facebook’s kingdom was largely dependent on

Cassandra, they still released it as an open-source project, possibly having assured on that it might

be too late for others to use the technology to knock its castle down. Cassandra became an Apache

Incubator project in 2009. Cassandra is a hybrid of column-oriented and key-value data store

being suitable to be deployed over both across many commodity servers and cloud infrastructure.

14.5.7.4 Google Cloud Datastore

Cloud Datastore is developed by Google and is available as a fully-managed NoSQL database

service. Cloud Datastore is very easy to use and supports SQL-like queries being called as GQL.

The Datastore is a NoSQL key-value database where users can store data as key-value pairs.

Cloud Datastore also supports ACID transactions using optimistic concurrency control.

14.5.7.5 MongoDB

MongoDB is a popular document-oriented open-source NoSQL database. It is developed by

New York City-based MongoDB Inc. and was first released as a product in 2009. It is written

in C++, JavaScript and C programming languages and uses GridFS as built-in distributed file

system. MongoDB runs well on many cloud based environments including Amazon EC2.

14.5.7.6 Amazon’s SimpleDB

SimpleDB is a fully-managed NoSQL data store offered by Amazon. It is a key-value store and

actually not a full database implementation. SimpleDB was first announced on December 2007

and works with both Amazon EC2 and Amazon S3.

14.5.7.7 Apache’s CouchDB

CouchDB is an open-source document-oriented NoSQL database. CouchDB was first developed

in 2005 by a former developer of IBM. Later in 2008, it was adopted as an Apache Incubator

project. Soon in 2010, the first stable version of CouchDB was released and it became popular.

14.5.7.8 Neo4j

Neo4j is an open-source graph database. It is developed in Java. Neo4j was developed by Neo

Technology of United States and was initially released in 2007. But its stable versions started

appearing from the year of 2010.254

Cloud Computing

Apart from these few NoSQL databases as mentioned above, there are numerous of other

products available in the market. New developments are happening in this field and many

new products are being launched. Database management application giant like Oracle has also

launched their own NoSQL database. Hence, much more of new advancements are expected in

this domain in coming years.

NON-RELATIONAL DBMS IN CLOUD

Archives

Categories

Meta