Course 441: Introduction to NoSQL Databases

Introduction to NoSQL Databases

Curriculum

Big Data and Machine Learning

Delivery methods

On-Site, Virtual

Duration

4 days

The growth of the internet has brought along with it the phenomena of Big Data and its massive quantities of rapidly evolving, unstructured information. The need to process and store this information in a timely and cost effective way has led to the adoption of the computer cluster as the infrastructure of choice. The adoption of computer clusters as a primary tool in the IT world has given greater impetus to the development of distributed systems that take full advantage of this infrastructure. Apache Spark is an example of such a distributed system for data processing. This course is about distributed persistence technologies, focusing on NoSQL databases, and their query languages.

To better accommodate the fact that Big Data usually means unstructured data, the NoSQL databases don’t make an attempt to impose a relational model. In fact, four different data models have distinguished themselves in the NoSQL eco-system: key-value, document, column-family, and graph.

Independent of the four data models, the NoSQL databases distinguish themselves in their approach to leveraging the cluster. At a high level, these differences can be understood in terms of the CAP theorem.

The Hadoop Distributed File System (HDFS) is the first persistence technology presented in the course. Well known for its role in Hadoop MapReduce, HDFS is also used directly by many Big Data and NoSQL applications including Apache Spark, Pig, Hive, and HBase. Each of these three technologies: Pig for data mining, Hive for data warehousing, and HBase as a NoSQL column-family store, are described.

The course then presents representative NoSQL databases for each of the four previously mentioned data models: MongoDB for document, Cassandra for column-family, Neo4j for graph, and Redis for key-value.

Prior to getting into the details of each database, the relational model is reviewed, forces introduced by the cluster, such as the degradation of consistency or availability, are identified, and the CAP theorem is examined.

For each particular NoSQL implementation, its architecture is described and positioned via the CAP theorem. Common use cases are presented and the API demonstrated. Specific approaches for achieving scalability are identified.

Learning objectives

After successfully completing this course, students will be able to:

Distinguish the different types of NoSQL databases
Understand the impact of the cluster on database design
State the CAP theorem and explain it main points
Explain where HBase, MongoDB, Cassandra, Neo4j, and Redis fit with the CAP theorem
Work with the Hadoop Distributed File System (HDFS) as a foundation for NoSQL technologies
Warehouse HDFS data using Apache Hive
Data mine HDFS data with Apache Spark-SQL and Apache Pig
Describe the design of HBase, MongoDB, Cassandra, Neo4j, and Redis
Use the data control, definition, and manipulation languages of the NoSQL databases covered in the course

Hands-on labs

The hands-on labs are a key learning element of this course. Each lab reinforces the material presented in lecture.

Prerequisites

The ability to read Java is necessary for many of the examples presented in the notes. The exercises require the ability to make slight adaptations of code presented in the notes. Note that all the NoSQL databases presented in this course publish client libraries in multiple languages. Java is one client language that is common to all the NoSQL databases used here.

When learning about NoSQL, it is helpful to be able to make comparisons with relational technology. Although the latter is reviewed in the course, it will be helpful if the participant has some prior experience with the relational model.

Course outline

1An Overview of NoSQL

Review of the Relational Model
ACID Properties
Distributed Databases: Sharding and Replication
Consistency
The CAP Theorem
NoSQL Data Models

2HDFS

Overview of HDFS
HDFS Deployment
Core HDFS Services
Check Pointing
Federated and High Availability HDFS
Multi-node Cluster with Docker

3Apache Hive as an HDFS Data Warehouse

What Is Hive?
Hive Metastore and HiveServer2
The Beeline Command-Line Interface
Creating Hive Internal and External Tables
Data Serialization and Deserialization (SerDes)
Hive Storage Formats including Avro, Sequence File, and Parquet
Hive Query Language (HQL)
Built-in and User-Defined Functions
Hive and Map Reduce
Partitions and Buckets
Mining Hive Data with Apache Pig and Apache Spark-SQL

4HBase

Configuring HBase
Data Model: Conceptual and Physical Views
Data Model Operations
Schema Creation
Row Key Design
Architecture Overview
HBase Shell

5MongoDB

The Document Data Model
Documents and Collections
MongoDB Use Cases
Embedded Data Models
Normalized Data
Replication via Replica Sets
MongoDB Design
MongoDB and the CAP Theorem
The MongoDB Data Manipulation Language
Transactions, Atomicity, and Documents
Durability and Journaling
Batch Processing and Aggregation
Indexing
Auto-Sharding, Shard Keys, and Horizontal Scalability
Writing to Shards
MongoDB as a File System

6Cassandra

The Column-Family Data Model
Databases and Tables
Columns, Types, and Keys
The Data Manipulation Language
Cassandra’s Architecture
Key Spaces, Replication, and Column-Families
The CAP Theorem
Consistent Hashing
Managing Cluster Nodes

7Neo4j

Overview of Graph Theory
The Graph Data Model
Relationships as First-Class Citizens
Graph Database Use Cases
Neo4j Design: Standalone and Cluster
ACID Properties and the CAP Theorem
Transaction Management with JTA
CRUD Operations with the Neo4j Core API
Navigating Graphs with the Traversal API
The Neo4j REST API
The Cypher Data Manipulation Language
Querying as Graph Traversal

8Redis

The Key-Value Data Model
Redis as a Cache
Commands and Pipelining
Durability/Persistence Mechanisms
Partitioning with Redis Cluster
Publish/Subscribe Messaging
Key Space Notifications
Automatic Deletion with Key Expiration
Bulk Data Loading
Transactions

Ready to accelerate your team's innovation?

Schedule a meeting

Unlock your team’s potential and get the most from your tech stack