Course 649:
Foundations of Data Scraping

(3 days)

 

Course Description

The Internet is awash with data. Sometimes it is easy to get, sometimes it looks to be impossible. It doesn’t have to be difficult. This course has students work though the different procedures that allow pulling data from almost any website and making the data ready for analysis. The procedures are in Python for ease of extension.

Learning Objectives

  • Understanding of HTML
  • Working knowledge of HTTP
  • Background in REST technologies
  • Data scraping with Beautiful Soup
  • Using Selenium to access non-HTML data
  • Data Scraping Log Files
  • Python String manipulation
  • Python Pandas for data IO
  • Python Pandas for data preparation

Prerequisites

Knowledge of simple Python is necessary.

Who Should Attend

This course is aimed at those who need to access data that is available only from a web page.


Course Outline

Necessary Background

  • Request Module and HTTP Protocol
  • Short Introduction to REST Design
  • HTML Basics
  • Short Introduction to CSS
  • Working with Jupyter Notebooks

When Scraping Is Not Required

  • Introduction to Pandas IO
  • Downloading Different File Formats: CSV, JSON
  • Simple Data Conversions

Hands-On: Displaying a Web Table

  • Accessing a Web Page
  • Details on Chromes Developer Tools
  • Scraping with Chrome

Extracting Data

  • Using Module Requests
  • Using Module Beautiful Soup

Necessary Data Manipulation

  • Creating Lists from Beautiful Soup
  • Creating DataFrames from Beautiful Soup
  • Creating .csv Files from DataFrames

 Hands-On: Extracting Tables from Websites

  • Complete Project From
    • Looking at Website
    • Extracting Data
    • Creating .csv Files

Using Selenium

  • Client-Side Scripting
  • What Selenium Does
  • A Step Down Process
  • Demonstration: Extracting Data

Hands-On Project: Extracting Table Using Selenium

  • Creating Procedure for Extracting Table
  • Using and Documenting Procedure for Extracting Table

Extracting Data From Log Files

  • Review of Regular Expressions
  • Handling Date-Time Fields
  • Demonstration: Extraction from Log File
  • Hands-On: Extraction from Log File
  • Using Pandas to Extract Data
  • Demonstration: Extraction from Log File
  • Hands-On: Extraction from Log File

Extracting Table from a Character-based PDF

  • Overview of Current Methods
  • Demonstration of Extracting Data from Character-based PDF
  • Hands-On: Procedure for Extracting Table

Extracting Table from a Non Character-based PDF

  • Overview of Current Methods
  • Demonstration of Extracting Data from Character-based PDF
  • Hands-On: Procedure for Extracting Table

Making a Standalone Program

  • Command Line Parameters
  • Converting Jupyter Notebook to a Python Script

Please Contact Your ROI Representative to Discuss Course Tailoring!