Home / Python / Day 9: Automation & Databases / Web Scraping Basics

Web Scraping Basics

Web scraping is the process of automatically extracting data from websites using Python libraries like requests and BeautifulSoup.

What is Web Scraping?

Web scraping means fetching a web page's HTML and extracting specific data from it programmatically, instead of copying it by hand. Common uses include price tracking, research data collection, and content aggregation.

The Two-Step Process

1) Fetch the page's HTML using requests.get(url). 2) Parse the HTML using a parser library such as BeautifulSoup (from bs4) to navigate and search the document structure.

Finding Elements

BeautifulSoup lets you search by tag name, CSS class, id, or attributes using methods like find(), find_all(), and CSS selectors via select().

Extracting Text and Attributes

Once an element is found, .text or .get_text() gets its text content, and element["href"] or .get("attr") retrieves attribute values like links.

Ethics and Legality

Always check a site's robots.txt and terms of service before scraping. Add delays between requests (time.sleep()) to avoid overloading servers, and identify your scraper with a proper User-Agent header.

Limitations

Basic scraping with requests + BeautifulSoup only works for static HTML. Pages rendered with JavaScript require tools like Selenium or Playwright.

Syntax

<pre><code>import requests
from bs4 import BeautifulSoup
import time

# Step 1: Fetch the page
url = "https://example.com/articles"
headers = {"User-Agent": "MyScraperBot/1.0"}
response = requests.get(url, headers=headers, timeout=5)
response.raise_for_status()

# Step 2: Parse the HTML
soup = BeautifulSoup(response.text, "html.parser")

# Find a single element
title = soup.find("h1")
print(title.text.strip())

# Find all matching elements
articles = soup.find_all("article", class_="post")
for article in articles:
    heading = article.find("h2")
    link = article.find("a")
    print(heading.text.strip())
    print(link.get("href"))

# Using CSS selectors
prices = soup.select(".price")
for price in prices:
    print(price.get_text(strip=True))

# Be respectful: add delays between requests
time.sleep(1)
</code></pre>

Revision Notes

• Web scraping = fetch HTML + parse + extract data
• requests.get() fetches the page; BeautifulSoup(html, "html.parser") parses it
• find() returns first match, find_all() returns a list
• .text / .get_text() extracts text; element["href"] gets an attribute
• Always check robots.txt and add delays (time.sleep)
• JS-heavy pages need Selenium/Playwright instead of requests

Extract All Links from HTML

Medium

Write a function extract_links(html) that takes a string of HTML and returns a list of all href attribute values found in <a> tags, using BeautifulSoup.

Input:

'<a href="https://a.com">A</a><a href="https://b.com">B</a>'

Output:

['https://a.com', 'https://b.com']

Show Hint

Use BeautifulSoup(html, "html.parser").find_all("a") to get all anchor tags, then extract link.get("href") for each.

Solve this Challenge

Show Solution

from bs4 import BeautifulSoup

def extract_links(html):
    soup = BeautifulSoup(html, "html.parser")
    return [a.get("href") for a in soup.find_all("a") if a.get("href")]

html = '<a href="https://a.com">A</a><a href="https://b.com">B</a>'
print(extract_links(html))

Automating Tasks with Python