How to Scrape Login-Protected Websites

Web scraping protected content requires a different approach than scraping public pages. Here’s my step-by-step guide to accessing and scraping data from password-protected websites using Selenium.

DataFuel integrates this exact functionality in its API at datafuel.dev. If you want to just get it done fast, use it, otherwise read on.

My Steps to Scrape a Password-Protected Website:

Capture the HTML form elements: username ID, password ID, and login button class
Use a tool like requests or Selenium to automate the login: fill username, wait, fill password, wait, click login
Store session cookies for authentication
Continue scraping the authenticated pages

Let’s use this example: let’s say I want to scrape my own API key from my account at datafuel.dev. It is located at https://app.datafuel.dev/account/api_key.

First, you need to find the login page. Most websites will give you a redirect 303 if you try to access a page behind login, so if you try to scrape directly https://app.datafuel.dev/account/api_key, you will automatically get the login page https://app.datafuel.dev/login. This is a good way to automate finding the login page if not provided already.

Here’s a script to find the login form elements:

from bs4 import BeautifulSoup

def extract_login_form(html_content: str):
    """
    Extracts the login form elements from the given HTML content and returns their CSS selectors.
    """
    soup = BeautifulSoup(html_content, "html.parser")

    # Finding the username/email field
    username_email = (
        soup.find("input", {"type": "email"})
        or soup.find("input", {"name": "username"})
        or soup.find("input", {"type": "text"})
    )  # Fallback to input type text if no email type is found

    # Finding the password field
    password = soup.find("input", {"type": "password"})

    # Finding the login button
    # Searching for buttons/input of type submit closest to the password or username field
    login_button = None

    # First try to find a submit button within the same form
    if password:
        form = password.find_parent("form")
        if form:
            login_button = form.find("button", {"type": "submit"}) or form.find(
                "input", {"type": "submit"}
            )
    # If no button is found in the form, fall back to finding any submit button
    if not login_button:
        login_button = soup.find("button", {"type": "submit"}) or soup.find(
            "input", {"type": "submit"}
        )

    # Extracting CSS selectors
    def generate_css_selector(element, element_type):
        if "id" in element.attrs:
            return f"#{element['id']}"
        elif "type" in element.attrs:
            return f"{element_type}[type='{element['type']}']"
        else:
            return element_type

    # Generate CSS selectors with the updated logic
    username_email_css_selector = None
    if username_email:
        username_email_css_selector = generate_css_selector(username_email, "input")

    password_css_selector = None
    if password:
        password_css_selector = generate_css_selector(password, "input")

    login_button_css_selector = None
    if login_button:
        login_button_css_selector = generate_css_selector(
            login_button, "button" if login_button.name == "button" else "input"
        )

    return username_email_css_selector, password_css_selector, login_button_css_selector

def main(html_content: str):
    # Call the extract_login_form function and return its result
    return extract_login_form(html_content)

2. Using Selenium to Actually Log In

Now you need to create a selenium webdriver. We will use chrome headless to run it with Python. This is how to install it:

# Install selenium and chromium

!pip install selenium
!apt-get update 
!apt install chromium-chromedriver

!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

Then actually log into our website and save the cookies:

# Imports
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
import time

# Set up Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# Initialize the WebDriver
driver = webdriver.Chrome(options=chrome_options)

try:
    # Open the login page
    driver.get("https://app.datafuel.dev/login")

    # Find the email input field by ID and input your email
    email_input = driver.find_element(By.ID, "email")
    email_input.send_keys("******@gmail.com")

    # Find the password input field by ID and input your password
    password_input = driver.find_element(By.ID, "password")
    password_input.send_keys("*******")

    # Find the login button and submit the form
    login_button = driver.find_element(By.CSS_SELECTOR, "button[type='submit']")
    login_button.click()

    # Wait for the login process to complete
    time.sleep(5)  # Adjust this depending on your site's response time


finally:
    # Close the browser
    driver.quit()

⚠️ Security Warning: Never hardcode credentials in your code. Instead, use environment variables or a secure configuration management system to store sensitive information:

import os
email = os.getenv('LOGIN_EMAIL')
password = os.getenv('LOGIN_PASSWORD')
# Use these variables in your login code
email_input.send_keys(email)
password_input.send_keys(password)

3. Store Cookies

It is as simple as saving them into a dictionary from the driver.getcookies() function:

def save_cookies(driver):
    """Save cookies from the Selenium WebDriver into a dictionary."""
    cookies = driver.get_cookies()
    cookie_dict = {}
    for cookie in cookies:
        cookie_dict[cookie['name']] = cookie['value']
    return cookie_dict

# Save the cookies from the WebDriver
cookies = save_cookies(driver)

4. Get Data from Our Logged-in Session

In this part, we will use the simple library requests, but you could keep using selenium too:

def scrape_api_key(cookies):
    """Use cookies to scrape the /account/api_key page."""
    url = 'https://app.datafuel.dev/account/api_key'
    
    # Set up the session to persist cookies
    session = requests.Session()

    # Add cookies from Selenium to the requests session
    for name, value in cookies.items():
        session.cookies.set(name, value)

    # Make the request to the /account/api_key page
    response = session.get(url)

    # Check if the request is successful
    if response.status_code == 200:
        print("API Key page content:")
        print(response.text)  # Print the page content (could contain the API key)
    else:
        print(f"Failed to retrieve API key page, status code: {response.status_code}")

5. Bonus: Using AI to Extract the API Key

Now let’s say we want to extract the API key from the response text. We can use AI to do that:

def extract_api_key_using_ai(response_text):
    """Use OpenAI's GPT model to extract the API key."""
    prompt = f"""
    You are an expert scraper, and you will extract only the information asked from the context.
    I need the value of my api-key from the following context:

    {response_text}
    """

    try:
        # Use OpenAI client to create a chat completion
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            model="gpt-3.5-turbo",  # You can change to gpt-4 if needed
        )

        # Extract the response from the AI
        extracted_api_key = chat_completion.choices[0].message.content
        return extracted_api_key

    except Exception as e:
        print(f"An error occurred with OpenAI API: {e}")
        return None

This approach is particularly useful when:

The API key format is inconsistent
The page structure changes frequently
The key is embedded in complex JavaScript code
You need to handle multiple possible formats

💡 Pro tip: Remember to handle the OpenAI API costs and rate limits appropriately in production environments.

If you found this guide helpful, you might also be interested in Web Scraping: The Art of Automated Data Collection. It is beginner guide to get started with web scraping with best practices.

How to Scrape Login-Protected Websites with Selenium (Step by Step Guide)

My Steps to Scrape a Password-Protected Website:

1. The Login Page

2. Using Selenium to Actually Log In

3. Store Cookies

4. Get Data from Our Logged-in Session

5. Bonus: Using AI to Extract the API Key

Try it yourself!