How to Scrape Login-Protected Websites with Selenium (Step by Step Guide)
Web scraping protected content requires a different approach than scraping public pages. Here’s my step-by-step guide to accessing and scraping data from password-protected websites using Selenium.
DataFuel integrates this exact functionality in its API at datafuel.dev. If you want to just get it done fast, use it, otherwise read on.
My Steps to Scrape a Password-Protected Website:
- Capture the HTML form elements: username ID, password ID, and login button class
- Use a tool like requests or Selenium to automate the login: fill username, wait, fill password, wait, click login
- Store session cookies for authentication
- Continue scraping the authenticated pages
Let’s use this example: let’s say I want to scrape my own API key from my account at datafuel.dev. It is located at https://app.datafuel.dev/account/api_key
.
1. The Login Page
First, you need to find the login page. Most websites will give you a redirect 303 if you try to access a page behind login, so if you try to scrape directly https://app.datafuel.dev/account/api_key
, you will automatically get the login page https://app.datafuel.dev/login
. This is a good way to automate finding the login page if not provided already.
Here’s a script to find the login form elements:
from bs4 import BeautifulSoup
def extract_login_form(html_content: str):
"""
Extracts the login form elements from the given HTML content and returns their CSS selectors.
"""
soup = BeautifulSoup(html_content, "html.parser")
# Finding the username/email field
username_email = (
soup.find("input", {"type": "email"})
or soup.find("input", {"name": "username"})
or soup.find("input", {"type": "text"})
) # Fallback to input type text if no email type is found
# Finding the password field
password = soup.find("input", {"type": "password"})
# Finding the login button
# Searching for buttons/input of type submit closest to the password or username field
login_button = None
# First try to find a submit button within the same form
if password:
form = password.find_parent("form")
if form:
login_button = form.find("button", {"type": "submit"}) or form.find(
"input", {"type": "submit"}
)
# If no button is found in the form, fall back to finding any submit button
if not login_button:
login_button = soup.find("button", {"type": "submit"}) or soup.find(
"input", {"type": "submit"}
)
# Extracting CSS selectors
def generate_css_selector(element, element_type):
if "id" in element.attrs:
return f"#{element['id']}"
elif "type" in element.attrs:
return f"{element_type}[type='{element['type']}']"
else:
return element_type
# Generate CSS selectors with the updated logic
username_email_css_selector = None
if username_email:
username_email_css_selector = generate_css_selector(username_email, "input")
password_css_selector = None
if password:
password_css_selector = generate_css_selector(password, "input")
login_button_css_selector = None
if login_button:
login_button_css_selector = generate_css_selector(
login_button, "button" if login_button.name == "button" else "input"
)
return username_email_css_selector, password_css_selector, login_button_css_selector
def main(html_content: str):
# Call the extract_login_form function and return its result
return extract_login_form(html_content)
2. Using Selenium to Actually Log In
Now you need to create a selenium webdriver. We will use chrome headless to run it with Python. This is how to install it:
# Install selenium and chromium
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
Then actually log into our website and save the cookies:
# Imports
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
import time
# Set up Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
# Initialize the WebDriver
driver = webdriver.Chrome(options=chrome_options)
try:
# Open the login page
driver.get("https://app.datafuel.dev/login")
# Find the email input field by ID and input your email
email_input = driver.find_element(By.ID, "email")
email_input.send_keys("******@gmail.com")
# Find the password input field by ID and input your password
password_input = driver.find_element(By.ID, "password")
password_input.send_keys("*******")
# Find the login button and submit the form
login_button = driver.find_element(By.CSS_SELECTOR, "button[type='submit']")
login_button.click()
# Wait for the login process to complete
time.sleep(5) # Adjust this depending on your site's response time
finally:
# Close the browser
driver.quit()
⚠️ Security Warning: Never hardcode credentials in your code. Instead, use environment variables or a secure configuration management system to store sensitive information:
import os
email = os.getenv('LOGIN_EMAIL')
password = os.getenv('LOGIN_PASSWORD')
# Use these variables in your login code
email_input.send_keys(email)
password_input.send_keys(password)
3. Store Cookies
It is as simple as saving them into a dictionary from the driver.getcookies() function:
def save_cookies(driver):
"""Save cookies from the Selenium WebDriver into a dictionary."""
cookies = driver.get_cookies()
cookie_dict = {}
for cookie in cookies:
cookie_dict[cookie['name']] = cookie['value']
return cookie_dict
# Save the cookies from the WebDriver
cookies = save_cookies(driver)
4. Get Data from Our Logged-in Session
In this part, we will use the simple library requests, but you could keep using selenium too:
def scrape_api_key(cookies):
"""Use cookies to scrape the /account/api_key page."""
url = 'https://app.datafuel.dev/account/api_key'
# Set up the session to persist cookies
session = requests.Session()
# Add cookies from Selenium to the requests session
for name, value in cookies.items():
session.cookies.set(name, value)
# Make the request to the /account/api_key page
response = session.get(url)
# Check if the request is successful
if response.status_code == 200:
print("API Key page content:")
print(response.text) # Print the page content (could contain the API key)
else:
print(f"Failed to retrieve API key page, status code: {response.status_code}")
5. Bonus: Using AI to Extract the API Key
Now let’s say we want to extract the API key from the response text. We can use AI to do that:
def extract_api_key_using_ai(response_text):
"""Use OpenAI's GPT model to extract the API key."""
prompt = f"""
You are an expert scraper, and you will extract only the information asked from the context.
I need the value of my api-key from the following context:
{response_text}
"""
try:
# Use OpenAI client to create a chat completion
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="gpt-3.5-turbo", # You can change to gpt-4 if needed
)
# Extract the response from the AI
extracted_api_key = chat_completion.choices[0].message.content
return extracted_api_key
except Exception as e:
print(f"An error occurred with OpenAI API: {e}")
return None
This approach is particularly useful when:
- The API key format is inconsistent
- The page structure changes frequently
- The key is embedded in complex JavaScript code
- You need to handle multiple possible formats
💡 Pro tip: Remember to handle the OpenAI API costs and rate limits appropriately in production environments.
If you found this guide helpful, you might also be interested in Web Scraping: The Art of Automated Data Collection. It is beginner guide to get started with web scraping with best practices.