Define the environment

First, we’ll define our environment. For this project. we’ll need the following libraries:

app.py
import beam

app = beam.App(
    name="web-scraper",
    cpu=4,
    memory="4Gi",
    gpu=0,
    python_version="python3.8",
    python_packages=["bs4", "transformers", "torch"],
)

Starting the environment

Spin up the environment by running beam start <your app>.py You’ll see the red beam text at the end of your shell path, which means you’ve entered the Beam environment!

Write scraping logic

Now, we’ll write logic to scrape the headlines from The New York Times. Create a new file - let’s call it scraper.py.

scraper.py
import time
import requests
from bs4 import BeautifulSoup
from transformers import pipeline


def scrape_nyt():
    res = requests.get("https://www.nytimes.com")
    soup = BeautifulSoup(res.content, "html.parser")
    # Grab all headlines
    headlines = soup.find_all("h3", class_="indicate-hover", text=True)

    total_headlines = len(headlines)
    negative_headlines = 0

    # Iterate through each headline
    for h in headlines:
        title = h.get_text()
        print(title)
        sentiment = predict_sentiment(title)

        print(sentiment)

        if sentiment.get("NEGATIVE") > sentiment.get("POSITIVE"):
            negative_headlines += 1

    print(f"{negative_headlines} negative headlines / {total_headlines} total")


def predict_sentiment(title):
    model = pipeline(
        "sentiment-analysis", model="siebert/sentiment-roberta-large-english"
    )
    result = model(title, truncation=True, top_k=2)
    prediction = {i["label"]: i["score"] for i in result}

    return prediction


if __name__ == "__main__":
    scrape_nyt()

Running the scraper

Now, we’re ready to run our code using Beam. In your terminal, run:

python scraper.py

You should see the headlines and the detected sentiment!