4 minute read

In this post, we’re going to create a web scraper to scrape the list of actors and all the movies or TV shows they’ve worked on in one of my favorite TV shows—Seinfeld—using the Scrapy library. After scraping the data, we’ll see which movies or TV shows other than Seinfeld most of the actors collaborated on, to see if there’s a similar movie or TV show I can watch next. Here’s a link to my project repository. Here’s how we set up the project:

Writing My Scraper

To begin, we create a Scrapy project with the commands

conda activate PIC16B
scrapy startproject IMDB_scraper
cd IMDB_scraper

and create the file imdb_spider.py in the spiders directory with the following:

# to run 
# scrapy crawl imdb_spider -o movies.csv

import scrapy

class ImdbSpider(scrapy.Spider):
    # name of spider
    name = 'imdb_spider'
    
    # Seinfeld IMDB page
    start_urls = ['https://www.imdb.com/title/tt0098904/']

    def parse(self, response):
        cast_and_crew = response.css("[href^=\"fullcredits\"]").attrib["href"]
        cast_and_crew = response.urljoin(cast_and_crew)
        yield scrapy.Request(cast_and_crew, callback = self.parse_full_credits)
    
    def parse_full_credits(self, response):
        actors = [a.attrib["href"] for a in response.css("td.primary_photo a")]
        response = response.replace(url = "https://www.imdb.com")

        for actor in actors:
            actor = response.urljoin(actor)
            yield scrapy.Request(actor, callback = self.parse_actor_page)

    def parse_actor_page(self, response):
        actor_name = response.css("span.itemprop::text").get()
        filmography = response.css("div.filmo-row")
        movie_or_TV = filmography.css("b a::text").getall()

        for movie_or_TV_name in movie_or_TV:
            yield {
                "actor" : actor_name,
                "movie_or_TV_name" : movie_or_TV_name
            }

Each of the parsing functions is explained in detail below.

Implementation of parse() Method

We’d like to navigate to the “Cast and Crew” page of the Seinfeld IMDB page. We notice that in the page’s CSS, the link to the “Cast and Crew” page starts with “fullcredits”, and is the only link to do so. We use CSS selection to navigate to the “Cast and Crew” page.


# 
def parse(self, response):
    """
    parses a movie or TV show's IMDB page, and calls parse_full_credits() on the "Cast and Crew" page on the movie or TV show's IMDB page
    """

    # using CSS selection to select links that start with "fullcredits"
    cast_and_crew = response.css("[href^=\"fullcredits\"]").attrib["href"]

    # adding "fullcredits" to the end of the Seinfeld IMDB page's URL
    cast_and_crew = response.urljoin(cast_and_crew)

    # passing cast_and_crew page to parse_full_credits()
    yield scrapy.Request(cast_and_crew, callback = self.parse_full_credits)

Implementation of parse_full_credits() Method

On the Cast and Crew page, scrolling down to the “Series Cast” section reveals a list of every actor or actress who worked on Seinfeld. We’d like to navigate to each of these actors and actresses to then scrape all the movies or TV shows they’ve worked on. All actors and actresses have a “td” tag with class “primary_photo”. Using CSS selection, we select the links to these actors’ and actresses’ pages.

def parse_full_credits(self, response):
    """
    parses the "Cast and Crew" page of a movie or TV show's IMDB page, and calls parse_actor_page on each of the actor's or actress' pages on the "Cast and Crew" page
    """

    # using CSS selection to select links with "td" tag with class "primary_photo"
    actors = [a.attrib["href"] for a in response.css("td.primary_photo a")]

    # changing response variable to "https://imdb.com", since actor and actress pages stem from "https://imdb.com"
    response = response.replace(url = "https://www.imdb.com")

    # passing actor pages to parse_actor_page()
    for actor in actors:
        actor = response.urljoin(actor)
        yield scrapy.Request(actor, callback = self.parse_actor_page)

Implementation of parse_actor_page() Method

On the actor’s or actress’ page, scrolling down to the “Filmography” section reveals every movie or TV show the actor or actress has worked in. First, we scrape the actor’s or actress’ name—the pages’s HTML reveals that the actor’s or actress’ name is in the first “span” tag with class “itemprop”, so we use CSS selection to select the first “span” tag with class “itemprop”. To scrape these movies and TV shows, we realize that the entire “Filmography” table is in a “div” tag with class “filmo-row odd” or “filmo-row even”. In this “Filmography” table, each row has a “b” tag with the movie or TV show’s title. Luckily, extraneous information about specific episodes the actor or actress was in is in an “a” tag, not a “b” tag. We use CSS selection to scrape all the movies and TV shows.

def parse_actor_page(self, response):
    """
    parses an actor's or actress' IMDB page, and yields a dictionary of the actor's or actress' name and all of the movies and TV shows they've worked on
    """

    # using CSS selection to select text in first "span" tag with class "itemprop"
    actor_name = response.css("span.itemprop::text").get()

    # using CSS selection to select rows with "div" tag with class "filmo-row"
    filmography = response.css("div.filmo-row")

    # using CSS selection to select "b" tag text in each row
    movie_or_TV = filmography.css("b a::text").getall()

    # yielding actor and movie_or_TV_name as a dictionary
    for movie_or_TV_name in movie_or_TV:
        yield {
            "actor" : actor_name,
            "movie_or_TV_name" : movie_or_TV_name
        }

Making my Recommendations

With the parsing methods finished, we run the line

scrapy crawl imdb_spider -o results.csv

to output a csv file of actors and the movies or TV shows they’ve worked on in the IMDB_scraper directory. We then analyze results.csv to see which movies or TV shows share actors with Seinfeld, ranked by number of shared movies or TV shows.

First, we import results.csv and examine it:

import pandas as pd

results = pd.read_csv("results.csv")
results.head()
actor movie_or_TV_name
0 Anita Wise Seinfeld
1 Anita Wise Corpsing
2 Anita Wise Bob Hope Presents the Ladies of Laughter
3 Anita Wise An Evening at the Improv
4 Tracy Kolis Popular

We then see which movies or TV shows are most common amongst this dataset of actors and actresses:

# use groupby to count occurrences of each movie or TV name, and sort by number of occurrences

results.groupby(["movie_or_TV_name"]).count().sort_values(by = ["actor"], ascending = False).rename(columns = {"actor" : "number of shared actors"}).head(10)
number of shared actors
movie_or_TV_name
Seinfeld 1448
ER 298
NYPD Blue 217
L.A. Law 208
Seinfeld: Inside Look 185
Entertainment Tonight 166
Murphy Brown 166
Murder, She Wrote 161
CSI: Crime Scene Investigation 157
Diagnosis Murder 152

As expected, Seinfeld is the most common TV show amongst our actors and actresses. Curiously, most of the next most common shows are crime dramas, a genre drastically different from the sitcom nature of Seinfeld.

Updated: