Documentation Index
Fetch the complete documentation index at: https://upstash.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
In this tutorial, we’ll build a multithreaded web scraper in Python that leverages Redis for caching responses to minimize redundant HTTP requests. The scraper will be capable of handling groups of URLs across multiple threads while caching responses to reduce load and improve performance.
Database Setup
Create a Redis database using the Upstash Console or Upstash CLI, and add UPSTASH_REDIS_REST_URL and UPSTASH_REDIS_REST_TOKEN to your .env file:
UPSTASH_REDIS_REST_URL=your_upstash_redis_url
UPSTASH_REDIS_REST_TOKEN=your_upstash_redis_token
This file will be used to load environment variables.
Installation
First, install the necessary libraries using the following command:
pip install threading requests upstash-redis python-dotenv
Code Explanation
We’ll create a multithreaded web scraper that performs HTTP requests on a set of grouped URLs. Each thread will check if the response for a URL is cached in Redis. If the URL has been previously requested, it will retrieve the cached response; otherwise, it will perform a fresh HTTP request, cache the result, and store it for future requests.
Code
Here’s the complete code:
import threading
import requests
from upstash_redis import Redis
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
# Initialize Redis client
redis = Redis.from_env()
# Group URLs by thread, with one or two overlapping URLs across groups
urls_to_scrape_groups = [
[
'https://httpbin.org/delay/1',
'https://httpbin.org/delay/4',
'https://httpbin.org/delay/2',
'https://httpbin.org/delay/5',
'https://httpbin.org/delay/3',
],
[
'https://httpbin.org/delay/5', # Overlapping URL
'https://httpbin.org/delay/6',
'https://httpbin.org/delay/7',
'https://httpbin.org/delay/2', # Overlapping URL
'https://httpbin.org/delay/8',
],
[
'https://httpbin.org/delay/3', # Overlapping URL
'https://httpbin.org/delay/9',
'https://httpbin.org/delay/10',
'https://httpbin.org/delay/4', # Overlapping URL
'https://httpbin.org/delay/11',
],
]
class Scraper(threading.Thread):
def __init__(self, urls):
threading.Thread.__init__(self)
self.urls = urls
self.results = {}
def run(self):
for url in self.urls:
cache_key = f"url:{url}"
# Attempt to retrieve cached response
cached_response = redis.get(cache_key)
if cached_response:
print(f"[CACHE HIT] {self.name} - URL: {url}")
self.results[url] = cached_response
continue # Skip to the next URL if cache is found
# If no cache, perform the HTTP request
print(f"[FETCHING] {self.name} - URL: {url}")
response = requests.get(url)
if response.status_code == 200:
self.results[url] = response.text
# Store the response in Redis cache
redis.set(cache_key, response.text)
else:
print(f"[ERROR] {self.name} - Failed to retrieve {url}")
self.results[url] = None
def main():
threads = []
for urls in urls_to_scrape_groups:
scraper = Scraper(urls)
threads.append(scraper)
scraper.start()
# Wait for all threads to complete
for scraper in threads:
scraper.join()
print("\nScraping results:")
for scraper in threads:
for url, result in scraper.results.items():
print(f"Thread {scraper.name} - URL: {url} - Response Length: {len(result) if result else 'Failed'}")
if __name__ == "__main__":
main()
Explanation
-
Threaded Scraper Class: The
Scraper class is a subclass of threading.Thread. Each thread takes a list of URLs and iterates over them to retrieve or fetch their responses.
-
Redis Caching:
- Before making an HTTP request, the scraper checks if the response is already in the Redis cache.
- If a cached response is found, it uses that response instead of making a new request, marked with
[CACHE HIT] in the logs.
- If no cached response exists, it fetches the content from the URL, caches the result in Redis, and proceeds.
-
Overlapping URLs:
- Some URLs are intentionally included in multiple groups to demonstrate the cache functionality across threads. Once a URL’s response is cached by one thread, another thread retrieving the same URL will pull it from the cache instead of re-fetching.
-
Main Function:
- The
main function initiates and starts multiple Scraper threads, each handling a group of URLs.
- It waits for all threads to complete before printing the results.
Running the Code
Once everything is set up, run the script using:
python your_script_name.py
Sample Output
You will see output similar to this:
[FETCHING] Thread-1 - URL: https://httpbin.org/delay/1
[FETCHING] Thread-1 - URL: https://httpbin.org/delay/4
[CACHE HIT] Thread-2 - URL: https://httpbin.org/delay/5
[FETCHING] Thread-3 - URL: https://httpbin.org/delay/3
...
Benefits of Using Redis Cache
Using Redis as a cache reduces the number of duplicate requests, particularly for overlapping URLs. It allows for quick retrieval of previously fetched responses, enhancing performance and reducing load.