Define the environment
First, we’ll define our environment. For this project. we’ll need the following
libraries:
import beam
app = beam.App(
name="web-scraper",
cpu=4,
memory="4Gi",
gpu=0,
python_version="python3.8",
python_packages=["bs4", "transformers", "torch"],
)
Starting the environment
Spin up the environment by running beam start <your app>.py
You’ll see the red
beam text at the end of your shell path, which means you’ve entered the Beam
environment!
Write scraping logic
Now, we’ll write logic to scrape the headlines from The New York Times. Create a
new file - let’s call it scraper.py
.
import time
import requests
from bs4 import BeautifulSoup
from transformers import pipeline
def scrape_nyt():
res = requests.get("https://www.nytimes.com")
soup = BeautifulSoup(res.content, "html.parser")
headlines = soup.find_all("h3", class_="indicate-hover", text=True)
total_headlines = len(headlines)
negative_headlines = 0
for h in headlines:
title = h.get_text()
print(title)
sentiment = predict_sentiment(title)
print(sentiment)
if sentiment.get("NEGATIVE") > sentiment.get("POSITIVE"):
negative_headlines += 1
print(f"{negative_headlines} negative headlines / {total_headlines} total")
def predict_sentiment(title):
model = pipeline(
"sentiment-analysis", model="siebert/sentiment-roberta-large-english"
)
result = model(title, truncation=True, top_k=2)
prediction = {i["label"]: i["score"] for i in result}
return prediction
if __name__ == "__main__":
scrape_nyt()
Running the scraper
Now, we’re ready to run our code using Beam. In your terminal, run:
You should see the headlines and the detected sentiment!