Blog Post 2
In this post, we’re going to create a web scraper to scrape the list of actors and all the movies or TV shows they’ve worked on in one of my favorite TV shows—Seinfeld—using the Scrapy library. After scraping the data, we’ll see which movies or TV shows other than Seinfeld most of the actors collaborated on, to see if there’s a similar movie or TV show I can watch next. Here’s a link to my project repository. Here’s how we set up the project:
Writing My Scraper
To begin, we create a Scrapy project with the commands
conda activate PIC16B
scrapy startproject IMDB_scraper
cd IMDB_scraper
and create the file imdb_spider.py
in the spiders
directory with the following:
# to run
# scrapy crawl imdb_spider -o movies.csv
import scrapy
class ImdbSpider(scrapy.Spider):
# name of spider
name = 'imdb_spider'
# Seinfeld IMDB page
start_urls = ['https://www.imdb.com/title/tt0098904/']
def parse(self, response):
cast_and_crew = response.css("[href^=\"fullcredits\"]").attrib["href"]
cast_and_crew = response.urljoin(cast_and_crew)
yield scrapy.Request(cast_and_crew, callback = self.parse_full_credits)
def parse_full_credits(self, response):
actors = [a.attrib["href"] for a in response.css("td.primary_photo a")]
response = response.replace(url = "https://www.imdb.com")
for actor in actors:
actor = response.urljoin(actor)
yield scrapy.Request(actor, callback = self.parse_actor_page)
def parse_actor_page(self, response):
actor_name = response.css("span.itemprop::text").get()
filmography = response.css("div.filmo-row")
movie_or_TV = filmography.css("b a::text").getall()
for movie_or_TV_name in movie_or_TV:
yield {
"actor" : actor_name,
"movie_or_TV_name" : movie_or_TV_name
}
Each of the parsing functions is explained in detail below.
Implementation of parse() Method
We’d like to navigate to the “Cast and Crew” page of the Seinfeld IMDB page. We notice that in the page’s CSS, the link to the “Cast and Crew” page starts with “fullcredits”, and is the only link to do so. We use CSS selection to navigate to the “Cast and Crew” page.
#
def parse(self, response):
"""
parses a movie or TV show's IMDB page, and calls parse_full_credits() on the "Cast and Crew" page on the movie or TV show's IMDB page
"""
# using CSS selection to select links that start with "fullcredits"
cast_and_crew = response.css("[href^=\"fullcredits\"]").attrib["href"]
# adding "fullcredits" to the end of the Seinfeld IMDB page's URL
cast_and_crew = response.urljoin(cast_and_crew)
# passing cast_and_crew page to parse_full_credits()
yield scrapy.Request(cast_and_crew, callback = self.parse_full_credits)
Implementation of parse_full_credits() Method
On the Cast and Crew page, scrolling down to the “Series Cast” section reveals a list of every actor or actress who worked on Seinfeld. We’d like to navigate to each of these actors and actresses to then scrape all the movies or TV shows they’ve worked on. All actors and actresses have a “td” tag with class “primary_photo”. Using CSS selection, we select the links to these actors’ and actresses’ pages.
def parse_full_credits(self, response):
"""
parses the "Cast and Crew" page of a movie or TV show's IMDB page, and calls parse_actor_page on each of the actor's or actress' pages on the "Cast and Crew" page
"""
# using CSS selection to select links with "td" tag with class "primary_photo"
actors = [a.attrib["href"] for a in response.css("td.primary_photo a")]
# changing response variable to "https://imdb.com", since actor and actress pages stem from "https://imdb.com"
response = response.replace(url = "https://www.imdb.com")
# passing actor pages to parse_actor_page()
for actor in actors:
actor = response.urljoin(actor)
yield scrapy.Request(actor, callback = self.parse_actor_page)
Implementation of parse_actor_page() Method
On the actor’s or actress’ page, scrolling down to the “Filmography” section reveals every movie or TV show the actor or actress has worked in. First, we scrape the actor’s or actress’ name—the pages’s HTML reveals that the actor’s or actress’ name is in the first “span” tag with class “itemprop”, so we use CSS selection to select the first “span” tag with class “itemprop”. To scrape these movies and TV shows, we realize that the entire “Filmography” table is in a “div” tag with class “filmo-row odd” or “filmo-row even”. In this “Filmography” table, each row has a “b” tag with the movie or TV show’s title. Luckily, extraneous information about specific episodes the actor or actress was in is in an “a” tag, not a “b” tag. We use CSS selection to scrape all the movies and TV shows.
def parse_actor_page(self, response):
"""
parses an actor's or actress' IMDB page, and yields a dictionary of the actor's or actress' name and all of the movies and TV shows they've worked on
"""
# using CSS selection to select text in first "span" tag with class "itemprop"
actor_name = response.css("span.itemprop::text").get()
# using CSS selection to select rows with "div" tag with class "filmo-row"
filmography = response.css("div.filmo-row")
# using CSS selection to select "b" tag text in each row
movie_or_TV = filmography.css("b a::text").getall()
# yielding actor and movie_or_TV_name as a dictionary
for movie_or_TV_name in movie_or_TV:
yield {
"actor" : actor_name,
"movie_or_TV_name" : movie_or_TV_name
}
Making my Recommendations
With the parsing methods finished, we run the line
scrapy crawl imdb_spider -o results.csv
to output a csv file of actors and the movies or TV shows they’ve worked on in the IMDB_scraper
directory. We then analyze results.csv to see which movies or TV shows share actors with Seinfeld, ranked by number of shared movies or TV shows.
First, we import results.csv and examine it:
import pandas as pd
results = pd.read_csv("results.csv")
results.head()
actor | movie_or_TV_name | |
---|---|---|
0 | Anita Wise | Seinfeld |
1 | Anita Wise | Corpsing |
2 | Anita Wise | Bob Hope Presents the Ladies of Laughter |
3 | Anita Wise | An Evening at the Improv |
4 | Tracy Kolis | Popular |
We then see which movies or TV shows are most common amongst this dataset of actors and actresses:
# use groupby to count occurrences of each movie or TV name, and sort by number of occurrences
results.groupby(["movie_or_TV_name"]).count().sort_values(by = ["actor"], ascending = False).rename(columns = {"actor" : "number of shared actors"}).head(10)
number of shared actors | |
---|---|
movie_or_TV_name | |
Seinfeld | 1448 |
ER | 298 |
NYPD Blue | 217 |
L.A. Law | 208 |
Seinfeld: Inside Look | 185 |
Entertainment Tonight | 166 |
Murphy Brown | 166 |
Murder, She Wrote | 161 |
CSI: Crime Scene Investigation | 157 |
Diagnosis Murder | 152 |
As expected, Seinfeld is the most common TV show amongst our actors and actresses. Curiously, most of the next most common shows are crime dramas, a genre drastically different from the sitcom nature of Seinfeld.