Course 649:
Foundations of Data Scraping
(3 days)
Course Description
The Internet is awash with data. Sometimes it is easy to get, sometimes it looks to be impossible. It doesn’t have to be difficult. This course has students work though the different procedures that allow pulling data from almost any website and making the data ready for analysis. The procedures are in Python for ease of extension.
Learning Objectives
- Understanding of HTML
- Working knowledge of HTTP
- Background in REST technologies
- Data scraping with Beautiful Soup
- Using Selenium to access non-HTML data
- Data Scraping Log Files
- Python String manipulation
- Python Pandas for data IO
- Python Pandas for data preparation
Prerequisites
Knowledge of simple Python is necessary.
Who Should Attend
This course is aimed at those who need to access data that is available only from a web page.
Course Outline
Necessary Background
- Request Module and HTTP Protocol
- Short Introduction to REST Design
- HTML Basics
- Short Introduction to CSS
- Working with Jupyter Notebooks
When Scraping Is Not Required
- Introduction to Pandas IO
- Downloading Different File Formats: CSV, JSON
- Simple Data Conversions
Hands-On: Displaying a Web Table
- Accessing a Web Page
- Details on Chromes Developer Tools
- Scraping with Chrome
Extracting Data
- Using Module Requests
- Using Module Beautiful Soup
Necessary Data Manipulation
- Creating Lists from Beautiful Soup
- Creating DataFrames from Beautiful Soup
- Creating .csv Files from DataFrames
Hands-On: Extracting Tables from Websites
- Complete Project From
- Looking at Website
- Extracting Data
- Creating .csv Files
Using Selenium
- Client-Side Scripting
- What Selenium Does
- A Step Down Process
- Demonstration: Extracting Data
Hands-On Project: Extracting Table Using Selenium
- Creating Procedure for Extracting Table
- Using and Documenting Procedure for Extracting Table
Extracting Data From Log Files
- Review of Regular Expressions
- Handling Date-Time Fields
- Demonstration: Extraction from Log File
- Hands-On: Extraction from Log File
- Using Pandas to Extract Data
- Demonstration: Extraction from Log File
- Hands-On: Extraction from Log File
Extracting Table from a Character-based PDF
- Overview of Current Methods
- Demonstration of Extracting Data from Character-based PDF
- Hands-On: Procedure for Extracting Table
Extracting Table from a Non Character-based PDF
- Overview of Current Methods
- Demonstration of Extracting Data from Character-based PDF
- Hands-On: Procedure for Extracting Table
Making a Standalone Program
- Command Line Parameters
- Converting Jupyter Notebook to a Python Script
Please Contact Your ROI Representative to Discuss Course Tailoring! |