Scraping and Sourcing Data with Python
Class Code | Date | Delivery Method | Cost |
---|---|---|---|
ITS-2598 |
|
Live Online - 2 sessions | $450 |
Before each live online session, Tech Training will provide a Zoom link for live online classes, along with any required class materials.
The ability to locate and acquire important data is a valuable skill for doing data analysis and data science.
In this class, we will:
- Explore many sources and repositories for valuable data acquisition such as open government and university datasets
- Explore popular social APIs (e.g., Facebook, Spotify, Twitter) and domain-specific APIs (e.g., healthcare, news, science and math) that store a wealth of data
- Discuss methods to query web servers, and request and parse data to extract the information you need
- Explore scraping various types of data from websites and how to read and extract text from documents (e.g., PDF, Word) along with methods to clean and store sourced and scraped data
- Learning Objectives
During this course, you will have the opportunity to:
- Explore a Variety of Public Data Repositories
- Understand Effective Means to Search for Valuable Data
- Use the Python Programming Language to Source and Scrape Data
- Use Popular Social and Domain-specific APIs to Access Data (e.g., Slack)
- Extract Text from Documents (e.g., data in PDFs, Word) and access PDF Tables
- Scrape Data from Web Pages
- Clean Scraped Data and store Sourced and Scraped Data
- Topic Outline
Overview of Data Sourcing
- Public Open Dataset
- Government Data
- University Data
- Milestone 1 Learning Exercise: Explore public data repositories
Introduction to the Python Programming Language
- Installing Anaconda
- Milestone 2 Learning Exercise: Learn how to use Jupyter Notebooks
Using Public APIs (Application Programming Interfaces)
- Explore Popular and Domain-specific APIs
- Common Conventions
- Parsing JSON
- Milestone 3 Learning Exercise: Access a public API (e.g., Facebook, Twitter, Google)
Extracting Text from Documents
- Milestone 4 Learning Exercise: Extract data from PDFs
Overview of Data Scraping
- Introduction to BeautifulSoup
- Parsing HTML and Javascript
- Milestone 5 Learning Exercise: Scrape data from a website
Cleaning Scraped Data
- Storing Sourced and Scraped Data
Conclusion: Next steps
- Prerequisites
Learners should have an understanding of Basic Python Programming.
Custom training workshops are available for this program
Technology training sessions structured around individual or group learning objectives. Learn more about custom training
University IT Technology Training sessions are available to a wide range of participants, including Stanford University staff, faculty, students, and employees of Stanford Hospitals & Clinics, such as Stanford Health Care, Stanford Health Care Tri-Valley, Stanford Medicine Partners, and Stanford Medicine Children's Health.
Additionally, some of these programs are open to interested individuals not affiliated with Stanford, allowing for broader community engagement and learning opportunities.