Skip to content
Extraits de code Groupes Projets
Valider 5ef26235 rédigé par Corentin Vande Kerckhove's avatar Corentin Vande Kerckhove
Parcourir les fichiers

update docextraction as imdb does not allow easy crawling anymore

parent 3ab331ed
Aucune branche associée trouvée
Aucune étiquette associée trouvée
Aucune requête de fusion associée trouvée
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from typing import List, Dict from typing import List, Dict
import codecs import codecs
import numpy import numpy
import os import os
import bs4 import bs4
import httplib2 import httplib2
import requests
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
class Actor: class Actor:
""" """
This class represents an actor. This class represents an actor.
| |
The instance attributes are: The instance attributes are:
actor_id: actor_id:
Identifier of the actor. Identifier of the actor.
name: name:
Name of the actor. Name of the actor.
movies: movies:
List of movies in which the actor has played. List of movies in which the actor has played.
""" """
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
actor_id: int actor_id: int
name: str name: str
movies: List["Movie"] movies: List["Movie"]
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def __init__(self, actor_id: int, name: str): def __init__(self, actor_id: int, name: str):
""" """
Constructor. Constructor.
:param actor_id: Identifier of the actor. :param actor_id: Identifier of the actor.
:param name: Name of the actor. :param name: Name of the actor.
""" """
self.actor_id = actor_id self.actor_id = actor_id
self.name = name self.name = name
self.movies = [] self.movies = []
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
class Movie: class Movie:
""" """
This class represents a movie_to_analyse. This class represents a movie_to_analyse.
| |
The instance attributes are: The instance attributes are:
movie_id: movie_id:
Identifier of the movie_to_analyse. Identifier of the movie_to_analyse.
name: name:
Name of the movie_to_analyse in the IMDb database. Name of the movie_to_analyse in the IMDb database.
actors: actors:
List of actors who have played in the movie_to_analyse. List of actors who have played in the movie_to_analyse.
summary: summary:
Summary of the movie_to_analyse. Summary of the movie_to_analyse.
""" """
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
movie_id: int movie_id: int
name: str name: str
actors: List[Actor] actors: List[Actor]
summary: str summary: str
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def __init__(self, movie_id: int, name: str): def __init__(self, movie_id: int, name: str):
""" """
Constructor. Constructor.
:param movie_id: Identifier of the movie_to_analyse. :param movie_id: Identifier of the movie_to_analyse.
:param name: Name fo the movie_to_analyse. :param name: Name fo the movie_to_analyse.
""" """
self.movie_id = movie_id self.movie_id = movie_id
self.name = name self.name = name
self.actors = [] self.actors = []
self.summary = "" self.summary = ""
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
class Parser: class Parser:
""" """
| |
The instance attributes are: The instance attributes are:
output: output:
Directory where to store the resulting data. Directory where to store the resulting data.
basic_url: basic_url:
Begin of the URL used to retrieve the HTML page of a movie_to_analyse. Begin of the URL used to retrieve the HTML page of a movie_to_analyse.
actors: actors:
Dictionary of actors (the identifiers are the key). Dictionary of actors (the identifiers are the key).
actors: actors:
Dictionary of actors (the names are the key). Dictionary of actors (the names are the key).
movies: movies:
Dictionary of movies (the identifiers are the key). Dictionary of movies (the identifiers are the key).
""" """
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
output: str output: str
basic_url: str basic_url: str
actors: Dict[int, Actor] actors: Dict[int, Actor]
actors_by_name: Dict[str, Actor] actors_by_name: Dict[str, Actor]
movies: Dict[int, Movie] movies: Dict[int, Movie]
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def __init__(self, output: str, basic_url: str) -> None: def __init__(self, output: str, basic_url: str) -> None:
""" """
Initialize the parser. Initialize the parser.
:param output: Directory where to store the results. :param output: Directory where to store the results.
:param basic_url: Beginning part of the URL of a movie_to_analyse page. :param basic_url: Beginning part of the URL of a movie_to_analyse page.
""" """
self.output = output + os.sep self.output = output + os.sep
self.basic_url = basic_url self.basic_url = basic_url
self.actors = dict() self.actors = dict()
self.actors_by_name = dict() self.actors_by_name = dict()
self.movies = dict() self.movies = dict()
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def extract_data(self, movie: str) -> None: def extract_data(self, movie: str) -> None:
""" """
Extract the "useful" data from the page. In practice, the following steps are executed: Extract the "useful" data from the page. In practice, the following steps are executed:
1. Build the URL of the movie_to_analyse page. 1. Build the URL of the movie_to_analyse page.
2. Create a new Movie instance and add it to the list. 2. Create a new Movie instance and add it to the list.
3. Download the HTML page and use an instance of BeautifulSoup to parse. 3. Download the HTML page and use an instance of BeautifulSoup to parse.
4. Extract all "div" tags and analyze those of the class "summary_text" (summary of the movie_to_analyse) and 4. Extract all "div" tags and analyze those of the class "summary_text" (summary of the movie_to_analyse) and
"credit_summary_item" (directors, producers, actors, etc.). "credit_summary_item" (directors, producers, actors, etc.).
:param movie: Analyzed movie_to_analyse. :param movie: Analyzed movie_to_analyse.
""" """
url = self.basic_url + movie url = self.basic_url + movie
doc_id = len(self.movies) + 1 # First actor_id = 1 doc_id = len(self.movies) + 1 # First actor_id = 1
movie = Movie(doc_id, movie) movie = Movie(doc_id, movie)
self.movies[doc_id] = movie self.movies[doc_id] = movie
# Download the HTML and parse it through Beautifulsoup # Download the HTML and parse it through Beautifulsoup
h = httplib2.Http("./docs/.cache") h = httplib2.Http("./docs/.cache")
resp, content = h.request(url, "GET") resp, content = h.request(url, "GET")
soup = bs4.BeautifulSoup(content, "html.parser") soup = bs4.BeautifulSoup(content, "html.parser")
# Extract the content # Extract infos
divs = soup.find_all("div") self.extract_summary(movie, soup)
for div in divs: self.extract_actors(movie, soup)
div_class = div.get("class")
if str(div_class)[:15] == "['GenresAndPlot":
spans=div.find_all("span")
for span in spans:
span_data=span.get("data-testid")
if span_data=="plot-xs_to_m":
try:
movie.summary = span.string.strip()
except:
movie.summary = span.contents[0]
print(movie.summary)
elif div_class == ['ipc-shoveler', 'title-cast__grid']:
self.extract_actors(movie, div)
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def extract_actors(self, movie, div) -> None: def extract_summary(self, movie, soup) -> None:
""" """
This function takes the content of a "div" tag to determined if it contains actors. In practice, the following This function extract the summary from a movie/tv-show
steps are executed: It use the find_all method of BeautifulSoup to find the "overview" class
"""
1. Look if there is a "h4" tag that contains "Stars:". divs = soup.find_all("div")
for div in divs:
2. Extract all the links that begins with "/name". These are links to actor pages and the name of div_class = div.get("class")
the actor is extracted. if div_class is not None:
if 'overview' in div_class:
movie.summary = div.text
print(movie.summary)
3. Each actor extracted to the global list of actors and the list of actors for the analyzed movie_to_analyse.
:param movie: Analyzed movie_to_analyse. # -------------------------------------------------------------------------
:param div: A "div" tag that could contain the actors. def extract_actors(self, movie, soup) -> None:
""" """
This function extract the list of actors displayed for a specific movie/tv-show
It use the select method of BeautifulSoup to extract actors displayed on the page.
Actor are defined in people scroller cards
"""
soup_results = soup.select("ol[class='people scroller'] li[class='card'] p a")
actors = [soup_result.text for soup_result in soup_results]
print(actors)
# Store actors in class dictionaries
for actor in actors:
if actor not in self.actors_by_name.keys():
actor_id = len(self.actors) + 1 # First actor_id = 1
new_actor = Actor(actor_id, actor)
self.actors[actor] = new_actor
self.actors_by_name[actor] = new_actor
self.actors_by_name[actor].movies.append(movie)
movie.actors.append(self.actors_by_name[actor])
# Look this are the actors
divs= div.find_all("div")
if (divs is None) :
return
# Extract all the text of the links beginning with "name"
for div in divs:
div_class = div.get("data-testid")
if div_class=="title-cast-item":
for link in div.find_all("a"):
href = link["href"]
if href[:5] != "/name" or link["class"] == ["ipc-lockup-overlay", "ipc-focusable"]:
continue
actor = link.string.strip()
# Add the fact that the current actor is in the current movie_to_analyse
if actor not in self.actors_by_name.keys():
actor_id = len(self.actors) + 1 # First actor_id = 1
new_actor = Actor(actor_id, actor)
self.actors[actor] = new_actor
self.actors_by_name[actor] = new_actor
self.actors_by_name[actor].movies.append(movie)
movie.actors.append(self.actors_by_name[actor])
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def write_files(self) -> None: def write_files(self) -> None:
""" """
Write all the file. Three thinks are done: Write all the file. Three thinks are done:
1. For each document, create a file (doc*.txt) that contains the summary and the name of 1. For each document, create a file (doc*.txt) that contains the summary and the name of
the actors. the actors.
2. Create a CSV file "actors.txt" with all the actors and their identifiers. 2. Create a CSV file "actors.txt" with all the actors and their identifiers.
3. Build a matrix actors/actors which elements represent the number of times two actors are playing in the same 3. Build a matrix actors/actors which elements represent the number of times
two actors are playing in the same
movie_to_analyse. movie_to_analyse.
4. Create a CSV file "links.txt" that contains all the pairs of actors having played together. 4. Create a CSV file "links.txt" that contains all the pairs of actors having played together.
""" """
# Write the clean text # Write the clean text
for movie in self.movies.values(): for movie in self.movies.values():
movie_file = codecs.open(self.output + 'doc' + str(movie.movie_id) + ".txt", 'w', "utf-8") movie_file = codecs.open(self.output + 'doc_' + str(movie.movie_id) + ".txt", 'w', "utf-8")
movie_file.write(movie.summary + "\n") movie_file.write(movie.summary + "\n")
for actor in movie.actors: for actor in movie.actors:
movie_file.write(actor.name + "\n") movie_file.write(actor.name + "\n")
# Write the list of actors # Write the list of actors
actors_file = codecs.open(self.output + "actors.txt", 'w', "utf-8") actors_file = codecs.open(self.output + "actors.txt", 'w', "utf-8")
for actor in self.actors.values(): for actor in self.actors.values():
actors_file.write(str(actor.actor_id) + ',"' + actor.name + '"\n') actors_file.write(str(actor.actor_id) + ',"' + actor.name + '"\n')
# Build the matrix actors/actors # Build the matrix actors/actors
matrix = numpy.zeros(shape=(len(self.actors), len(self.actors))) matrix = numpy.zeros(shape=(len(self.actors), len(self.actors)))
for movie in self.movies.values(): for movie in self.movies.values():
for i in range(0, len(movie.actors) - 1): for i in range(0, len(movie.actors) - 1):
for j in range(i + 1, len(movie.actors)): for j in range(i + 1, len(movie.actors)):
# ! Matrix begins with 0, actors with 1 # ! Matrix begins with 0, actors with 1
matrix[movie.actors[i].actor_id - 1, movie.actors[j].actor_id - 1] += 1 matrix[movie.actors[i].actor_id - 1, movie.actors[j].actor_id - 1] += 1
matrix[movie.actors[j].actor_id - 1, movie.actors[i].actor_id - 1] += 1 matrix[movie.actors[j].actor_id - 1, movie.actors[i].actor_id - 1] += 1
# Write only the positive links # Write only the positive links
links_file = codecs.open(self.output + "links.txt", 'w', "utf-8") links_file = codecs.open(self.output + "links.txt", 'w', "utf-8")
for i in range(0, len(self.actors) - 1): for i in range(0, len(self.actors) - 1):
for j in range(i + 1, len(self.actors)): for j in range(i + 1, len(self.actors)):
weight = matrix[i, j] weight = matrix[i, j]
if weight > 0.0: if weight > 0.0:
# ! Matrix begins with 0, actors with 1 # ! Matrix begins with 0, actors with 1
links_file.write(str(i + 1) + "," + str(j + 1) + "," + str(weight) + "\n") links_file.write(str(i + 1) + "," + str(j + 1) + "," + str(weight) + "\n")
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# ---------------------------------------------------------------------------------------------------------------------- # ----------------------------------------------------------------------------------------
# Initialize a list of movies to download # Initialize a list of movies to download
movies = ["0078788", "0068646", "0083891","0113189"] movies = [
basic_url_to_analyze = "https://www.imdb.com/title/tt" ("Harry Potter 1","671"),
("Titanic","597"),
("The Wolf of Wall Street", "106646"),
]
basic_url_to_analyze = 'https://www.themoviedb.org/movie/'
dir_docs = "./docs" dir_docs = "./docs"
# ---------------------------------------------------------------------------------------------------------------------- # -----------------------------------------------------------------------------------------
# Use our custom parser to download each HTML page and save the actors and the links # Use our custom parser to download each HTML page and save the actors and the links
parser = Parser(dir_docs, basic_url_to_analyze) parser = Parser(dir_docs, basic_url_to_analyze)
for movie_to_analyse in movies: for movie_label, movie_id in movies:
parser.extract_data(movie_to_analyse) parser.extract_data(movie_id)
parser.write_files() parser.write_files()
``` ```
%% Output %% Output
A U.S. Army officer serving in Vietnam is tasked with assassinating a renegade Special Forces Colonel who sees himself as a god.
The Godfather follows Vito Corleone Don of the Corleone family as he passes the mantel to his son Michael Harry Potter has lived under the stairs at his aunt and uncle's house his whole life. But on his 11th birthday, he learns he's a powerful wizard—with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school's kindly headmaster, Harry uncovers the truth about his parents' deaths—and about the villain who's to blame.
A C.I.A. Agent tries to infiltrate Soviet intelligence to stop a murderous diabolical plot.
Years after a friend and fellow 00 agent is killed on a joint mission, a secret space based weapons program known as "GoldenEye" is stolen. James Bond sets out to stop a Russian crime syndic... ['Daniel Radcliffe', 'Rupert Grint', 'Emma Watson', 'Richard Harris', 'Tom Felton', 'Alan Rickman', 'Robbie Coltrane', 'Maggie Smith', 'Richard Griffiths']
101-year-old Rose DeWitt Bukater tells the story of her life aboard the Titanic, 84 years later. A young Rose boards the ship with her mother and fiancé. Meanwhile, Jack Dawson and Fabrizio De Rossi win third-class tickets aboard the ship. Rose tells the whole story from Titanic's departure through to its death—on its first and last voyage—on April 15, 1912.
['Leonardo DiCaprio', 'Kate Winslet', 'Billy Zane', 'Gloria Stuart', 'Kathy Bates', 'Frances Fisher', 'Bill Paxton', 'Bernard Hill', 'David Warner']
A New York stockbroker refuses to cooperate in a large securities fraud case involving corruption on Wall Street, corporate banking world and mob infiltration. Based on Jordan Belfort's autobiography.
['Leonardo DiCaprio', 'Jonah Hill', 'Margot Robbie', 'Matthew McConaughey', 'Kyle Chandler', 'Rob Reiner', 'Jon Bernthal', 'Jean Dujardin', 'Kenneth Choi']
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
0% Chargement en cours ou .
You are about to add 0 people to the discussion. Proceed with caution.
Terminez d'abord l'édition de ce message.
Veuillez vous inscrire ou vous pour commenter