Course 441:
Introduction to NoSQL Databases

(4 days)

 

Course Description

The growth of the internet has brought along with it the phenomena of Big Data and its massive quantities of rapidly evolving, unstructured information. The need to process and store this information in a timely and cost effective way has led to the adoption of the computer cluster as the infrastructure of choice. The adoption of computer clusters as a primary tool in the IT world has given greater impetus to the development of distributed systems that take full advantage of this infrastructure. Apache Spark is an example of such a distributed system for data processing. This course is about distributed persistence technologies, focusing on NoSQL databases, and their query languages.

To better accommodate the fact that Big Data usually means unstructured data, the NoSQL databases don’t make an attempt to impose a relational model. In fact, four different data models have distinguished themselves in the NoSQL eco-system: key-value, document, column-family, and graph.

Independent of the four data models, the NoSQL databases distinguish themselves in their approach to leveraging the cluster. At a high level, these differences can be understood in terms of the CAP theorem.

The Hadoop Distributed File System (HDFS) is the first persistence technology presented in the course. Well known for its role in Hadoop MapReduce, HDFS is also used directly by many Big Data and NoSQL applications including Apache Spark, Pig, Hive, and HBase. Each of these three technologies: Pig for data mining, Hive for data warehousing, and HBase as a NoSQL column-family store, are described.

The course then presents representative NoSQL databases for each of the four previously mentioned data models: MongoDB for document, Cassandra for column-family, Neo4j for graph, and Redis for key-value.

Prior to getting into the details of each database, the relational model is reviewed, forces introduced by the cluster, such as the degradation of consistency or availability, are identified, and the CAP theorem is examined.

For each particular NoSQL implementation, its architecture is described and positioned via the CAP theorem. Common use cases are presented and the API demonstrated. Specific approaches for achieving scalability are identified.

Learning Objectives

After successfully completing this course, students will be able to:

  • Distinguish the different types of NoSQL databases
  • Understand the impact of the cluster on database design
  • State the CAP theorem and explain it main points
  • Explain where HBase, MongoDB, Cassandra, Neo4j, and Redis fit with the CAP theorem
  • Work with the Hadoop Distributed File System (HDFS) as a foundation for NoSQL technologies
  • Warehouse HDFS data using Apache Hive
  • Data mine HDFS data with Apache Spark-SQL and Apache Pig
  • Describe the design of HBase, MongoDB, Cassandra, Neo4j, and Redis
  • Use the data control, definition, and manipulation languages of the NoSQL databases covered in the course

Hands-On Labs

The hands-on labs are a key learning element of this course. Each lab reinforces the material presented in lecture.

Prerequisites

The ability to read Java is necessary for many of the examples presented in the notes. The exercises require the ability to make slight adaptations of code presented in the notes. Note that all the NoSQL databases presented in this course publish client libraries in multiple languages. Java is one client language that is common to all the NoSQL databases used here.

When learning about NoSQL, it is helpful to be able to make comparisons with relational technology. Although the latter is reviewed in the course, it will be helpful if the participant has some prior experience with the relational model.


Course Outline

1. An Overview of NoSQL (1 hour)

  • Review of the Relational Model
  • ACID Properties
  • Distributed Databases: Sharding and Replication
  • Consistency
  • The CAP Theorem
  • NoSQL Data Models

2. HDFS (3 hours)

  • Overview of HDFS
  • HDFS Deployment
  • Core HDFS Services
  • Check Pointing
  • Federated and High Availability HDFS
  • Multi-node Cluster with Docker

3. Apache Hive as an HDFS Data Warehouse (5 hours)

  • What Is Hive?
  • Hive Metastore and HiveServer2
  • The Beeline Command-Line Interface
  • Creating Hive Internal and External Tables
  • Data Serialization and Deserialization (SerDes)
  • Hive Storage Formats including Avro, Sequence File, and Parquet
  • Hive Query Language (HQL)
  • Built-in and User-Defined Functions
  • Hive and Map Reduce
  • Partitions and Buckets
  • Mining Hive Data with Apache Pig and Apache Spark-SQL

4. HBase (4 hours)

  • Configuring HBase
  • Data Model: Conceptual and Physical Views
  • Data Model Operations
  • Schema Creation
  • Row Key Design
  • Architecture Overview
  • HBase Shell

5. MongoDB (5 hours)

  • The Document Data Model
  • Documents and Collections
  • MongoDB Use Cases
  • Embedded Data Models
  • Normalized Data
  • Replication via Replica Sets
  • MongoDB Design
  • MongoDB and the CAP Theorem
  • The MongoDB Data Manipulation Language
  • Transactions, Atomicity, and Documents
  • Durability and Journaling
  • Batch Processing and Aggregation
  • Indexing
  • Auto-Sharding, Shard Keys, and Horizontal Scalability
  • Writing to Shards
  • MongoDB as a File System

6. Cassandra (6 hours)

  • The Column-Family Data Model
  • Databases and Tables
  • Columns, Types, and Keys
  • The Data Manipulation Language
  • Cassandra’s Architecture
  • Key Spaces, Replication, and Column-Families
  • The CAP Theorem
  • Consistent Hashing
  • Managing Cluster Nodes

7. Neo4j (2.5 hours)

  • Overview of Graph Theory
  • The Graph Data Model
  • Relationships as First-Class Citizens
  • Graph Database Use Cases
  • Neo4j Design: Standalone and Cluster
  • ACID Properties and the CAP Theorem
  • Transaction Management with JTA
  • CRUD Operations with the Neo4j Core API
  • Navigating Graphs with the Traversal API
  • The Neo4j REST API
  • The Cypher Data Manipulation Language
  • Querying as Graph Traversal

8. Redis (Optional) (2.5 hours)

  • The Key-Value Data Model
  • Redis as a Cache
  • Commands and Pipelining
  • Durability/Persistence Mechanisms
  • Partitioning with Redis Cluster
  • Publish/Subscribe Messaging
  • Key Space Notifications
  • Automatic Deletion with Key Expiration
  • Bulk Data Loading
  • Transactions
Please Contact Your ROI Representative to Discuss Course Tailoring!