Guides
Running a Web Scraper
Let’s build a simple web scraper which extracts headlines from The New York Times and uses a BERT model from Huggingface to detect the sentiment of each.
Define the environment
First, we’ll define our environment. For this project. we’ll need the following libraries:
app.py
import beam
app = beam.App(
name="web-scraper",
cpu=4,
memory="4Gi",
gpu=0,
python_version="python3.8",
python_packages=["bs4", "transformers", "torch"],
)
Starting the environment
Spin up the environment by running beam start <your app>.py
You’ll see the red
beam text at the end of your shell path, which means you’ve entered the Beam
environment!
Write scraping logic
Now, we’ll write logic to scrape the headlines from The New York Times. Create a
new file - let’s call it scraper.py
.
scraper.py
import time
import requests
from bs4 import BeautifulSoup
from transformers import pipeline
def scrape_nyt():
res = requests.get("https://www.nytimes.com")
soup = BeautifulSoup(res.content, "html.parser")
# Grab all headlines
headlines = soup.find_all("h3", class_="indicate-hover", text=True)
total_headlines = len(headlines)
negative_headlines = 0
# Iterate through each headline
for h in headlines:
title = h.get_text()
print(title)
sentiment = predict_sentiment(title)
print(sentiment)
if sentiment.get("NEGATIVE") > sentiment.get("POSITIVE"):
negative_headlines += 1
print(f"{negative_headlines} negative headlines / {total_headlines} total")
def predict_sentiment(title):
model = pipeline(
"sentiment-analysis", model="siebert/sentiment-roberta-large-english"
)
result = model(title, truncation=True, top_k=2)
prediction = {i["label"]: i["score"] for i in result}
return prediction
if __name__ == "__main__":
scrape_nyt()
Running the scraper
Now, we’re ready to run our code using Beam. In your terminal, run:
python scraper.py
You should see the headlines and the detected sentiment!