Course 401:
Big Data Essentials for Managers & Non-Programmers

(2 days)

 

Course Description

In recent years, enterprises have seen unprecedented increases in the amount and variety of data being captured as a byproduct of daily operations. Emerging technologies now enable managers to exploit this data, to gain previously unattainable insights into their operations and their customers, and to help identify new opportunities that will propel their enterprises to the next level. This course provides insight into these opportunities and technologies, to help provide focus for big data management and analysis.

The course has hands-on exposure to some of the key tools and includes workshops to investigate how Big Data technology can be applied to different business problems and data types.

Learning Objectives

  • Understand the definition and significance of big data and big data analytics in the enterprise
  • Discuss the architecture, operation, and business benefits of a Big Data solution
  • Examine the potential business opportunities that big data capabilities can uncover
  • Explore the relationship between cloud computing and big data
  • Identify how Big Data differs from traditional data solutions
  • Gain insight into emerging management and analytical tools for big data, and compare them with traditional data manipulation tools

Who Should Attend

Audience includes executives, project managers at all levels of experience, and software practitioners with an interest in understanding concepts and technologies—and related opportunities and constraints—in exploiting big data in the enterprise.

Prerequisites

No specific prerequisites are required. Basic UNIX/Linux command line skills are helpful, but hands-on labs and workshops are fully guided. Programming experience is not required.

Hands-On Exercises and Workshops

  • Workshop: Finding Big Data Sources, Opportunities, and Challenges in Your Enterprise
  • Hands-On Exercise: Using a Key-Value Datastore
  • Hands-On Exercise: Using a Document Datastore
  • Hands-On Exercise: Examining an Application That Uses a Graph Datastore
  • Hands-On Exercise: Column-Oriented Data Storage
  • Hands-On Exercise: Using a Distributed File System
  • Workshop: Looking at Your Data Differently
  • Hands-On: Running a MapReduce Job
  • Hands-On: Extracting Information from Semi-Structured Data
  • Hands-On: Extracting Di-grams from Text Input
  • Hands-On: Data Aggregation and Tabulation
  • Workshop: Processing Data in New Ways
  • Hands-On: Preparing your Data Set for Analysis
  • Hands-On: Extracting Sentiment from a Data Set with Pig
  • Hands-On: Querying Your Data Set for Intensity
  • Hands-On: Using HiveQL
  • Hands-On: Performing Ad-hoc Queries against a Data Warehouse with Hive
  • Workshop: Analyzing/Mining Your Data
  • Hands-On: Understand and Use a Recommender Engine
  • Hands-On: Seeing What R Can Do
  • Hands-On: Use R to Process and Visualize Data
  • Workshop: Data Analytics for the Enterprise
  • Hands-On: Examine Data Sources and Identify Useful Data
  • Workshop: Solving a Problem with Big Data Tools

Course Outline

Chapter 1: Introduction to Big Data

  • What Is Big Data?
    • The Technical Challenges Posed by Big Data
    • Common Big Data Use Cases
    • Structured, Semi-Structured, and Unstructured Data
    • Transforming Data to Information to Business Value
    • Big Data Stack
  • Working with Big Data
    • Locating/Extraction
    • Processing
    • Storage
    • Analysis/Interpretation
    • Representation
  • Tools of Big Data
    • Distributed File Systems
    • No-SQL Storage
  • Compute Engines
  • Distributed Infrastructures
  • Analysis Engines
  • Graphical Representation
  • Cloud Computing
    • Definition
    • Relationship to Big Data
    • Characteristics
  • Business Opportunities in Big Data
    • Investment, Costs, and Benefits
    • Skill Sets Needed (Analytics, Programming, Systems Management, Presentation)
    • Data Characteristics (Sources, What to Keep, How Long)
  • Use Cases Workshop: Finding Big Data Sources, Opportunities, and Challenges in Your Enterprise

Chapter 2: Storing Data

  • Traditional Data Storage
    • Defined
    • Files and Folders
    • File Structure Internals (ASCII/Binary)
    • Flat Files (fixed width, csv/tab delimited, xml)
    • Structured Data Storage
    • Row-Oriented Storage
    • Relational Databases
    • Reaching Limitations
  • The New Storage – Not Only SQL (NoSQL)
  • Why We Need NoSQL Datastores?
  • Key-Value Datastores
  • Hands-On: Using a Key-Value Datastore
  • Document Datastores
  • Hands-On: Using a Document Datastore
  • Graph Datastores
  • Hands-On: Examining an Application that Uses a Graph Datastore
  • Column-Oriented Data Storage
  • Why Do I Need Column-Oriented Storage?
  • Capabilities and Limitations
  • Querying Data from a Column-Oriented Datastore
  • Partitioning and Sharding
  • Hands-On: Column-Oriented Data Storage
  • Distributed File Systems
  • How Much Data Can Be Stored?
    • Traditional File Server
    • Enterprise Storage (NAS)
    • Cloud Storage
    • Distributed File Systems
  • Scaling Structured Data Storage
  • Hands-On: Using a Distributed File System
  • Real-World Examples That Use NoSQL Data Stores and Distributed File Systems
  • Use Case Workshop: Looking at Your Data Differently

Chapter 3: Compute Engines

  • Compute Engines
    • Common Enterprise Architectures
      • Distributed Systems and Service-Oriented Architectures
      • Client-Server, Middleware, and Clustering
    • Scaling Horizontally vs. Scaling Vertically
      • Bigger Single-Systems
      • Parallelization/Grid-Computing
    • What Is a Compute Engine?
      • Distributed Processing
      • Dynamic Resource Allocation
      • Building vs. Buying
    • MapReduce
      • Patterns
      • Algorithms
      • Use Cases
    • Hands-On: Running a MapReduce job
  • Hadoop Introduction
  • Introduction to Hadoop and YARN
  • Hadoop Distributed File System
  • Processing Data with Hadoop
  • Mapping and Reducing Data
  • Partitioning and Mapping
  • Streaming Data
  • Common Hadoop Tasks and Tools
    • Sequential Processing
    • Extracting and Transforming Large Data Sets
    • Basic UNIX/Linux Commands (grep, wc, awk, uniq, sort)
    • Search, Count, Tabulate
    • Hands-On: Extracting Information from Semi-Structured Data
    • Di-gram / Word-Pair Extraction
    • Hands-On: Extracting Di-grams from Text Input
    • Data Aggregation
    • Hands-On: Data Aggregation and Tabulation
    • Beyond Text: Image/Sound Processing
  • Common Hadoop Tools
    • Sequential Processing
    • Leverage Scripting: Perl/Python
    • How Programmers (Java) Further Extend Hadoop
  • Real-World Examples That Use Hadoop and Hadoop Distributed File System
  • Use Case Workshop: Processing Data in New Ways

Chapter 4: Analyzing Large Data Sets

  • Process of Analyzing Big Data Sets
    • Map-Reduce
    • Analysis
    • Creating MapReduce Jobs with Higher Level Languages
    • Hands-On: Preparing Your Data Set for Analysis
  • Pig
    • What Is Pig and How Does it Relate to Hadoop?
    • Simplify the Data Flow
  • Pre-processing the Data
  • Using Grunt (the interactive shell for Pig)
  • Loading and ForEach/Generating
  • Hands-On: Extracting Sentiment from a Data Set with Pig
  • Understanding a Pig Script
    • Loading
    • ForEach / Generate
    • Dump and Limit
    • Filtering, Grouping, and Ordering
  • Storing Results
  • Hands-On: Querying Your Data Set for Intensity
  • Hive
    • What Is Hive and How Does it Relate to Hadoop?
    • Creating the Data Warehouse
    • Projecting Structure on Stored Data
    • Hive Query Language (HiveQL) Basics
    • Hands-On: Using HiveQL
    • Batch Processing vs. Real-Time Queries
    • Performing Ad-hoc Queries
      • Functions
      • Working Around Limitations
    • Data Warehouse Concepts
      • Partitioning
      • Sampling / Buckets
    • Hands-On: Performing Ad-hoc Queries against a Data Warehouse
  • Real-World Examples that Use Pig and Hive
  • Use Case Workshop: Analyzing/Mining Your Data

Chapter 5: Data Analytics and Visualization

  • Leveraging Big Data within Applications
    • Clustering / Grouping Data That Shares Similarities
    • Classification / Categorize Data
    • Recommender Engines
  • Mahout
    • What Is Mahout?
    • Integrating within an Application
    • Hands-On: Understand and Use a Recommender Engine
  • R – Statistical Programming and Visualization Language
    • What Is R and How Does it Apply to Big Data?
    • Performing Analysis with R
    • How Does R and Hadoop Integrate (RHadoop)?
    • Visualizing Data with R
    • Hands-On: Seeing What R Can Do
    • Integrating within an Application
    • Hands-On: Use R to Process and Visualize Data
  • Real-World Examples of Data Analytics
  • Use Case Workshop: Data Analytics for the Enterprise

Chapter 6: Bringing It All Together

  • Review of Big Data Layers
  • Architecting a Big Data Solution (Data Lake)
    • Architecture Data Sources
    • Transformation and Storage of Data
    • Processing and Analysis of Data
    • Consumption of Data
  • Finding Sources of Data
    • Traditional Sources
    • Document Data and Meta-Data
    • Enterprise Systems
    • Logs and Events
    • Temporal Data
    • Hands-On: Examine Data Sources and Identify Useful Data
  • Understanding and Expanding the Big Data Ecosystem Tools (high-level)
    • Batch Processing: Pig, Hive
    • Interactive Querying: Impala, Tez, and Hawq
    • Stream Processing: Storm, Spark
    • Search: Solr, ElasticSearch
    • Machine Learning: Mahout, R
    • Infrastructure: Oozie, Flume, Sqoop, Zookeeper, Sentry, hCatalog
    • Building Your Own Solutions
  • Commercial Cloud Solutions Overview
    • Google Big Data Solutions
    • Amazon Big Data Solutions
  • Final Use Case Workshop: Solving a Problem with Big Data Technologies

Please Contact Your ROI Representative to Discuss Course Tailoring!