Skip to content
Extraits de code Groupes Projets
Valider 7cc3754b rédigé par Corentin Vande Kerckhove's avatar Corentin Vande Kerckhove
Parcourir les fichiers

adjust path to docs

parent b7c93c85
Aucune branche associée trouvée
Aucune étiquette associée trouvée
Aucune requête de fusion associée trouvée
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Text Mining (concepts) - Exercises 6, 7, 8 et 9 # Text Mining (concepts) - Exercises 6, 7, 8 et 9
In the following notebook, we are aiming at calculating the similarities between the following movies: In the following notebook, we are aiming at calculating the similarities between the following movies:
- ("Harry Potter 1","671"), - ("Harry Potter 1","671"),
- ("Harry Potter 2","672"), - ("Harry Potter 2","672"),
- ("The lord of the ring 1", "120"), - ("The lord of the ring 1", "120"),
- ("The Hobbit 1", "49051"), - ("The Hobbit 1", "49051"),
Using the outcome of Exercise 2, extract a summary of those 4 movies (including actors). Using the outcome of Exercise 2, extract a summary of those 4 movies (including actors).
In the present notebook, we will calculate the vectorized version of the 4 documents and calculate the cosine similarities between each pair of movies. In the present notebook, we will calculate the vectorized version of the 4 documents and calculate the cosine similarities between each pair of movies.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Packages ### Packages
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import math import math
from typing import Dict from typing import Dict
from typing import List from typing import List
from typing import Optional from typing import Optional
from os import sep from os import sep
from os import walk from os import walk
import numpy import numpy
#@COMPLETE : add here missing packages for Text Mining #@COMPLETE : add here missing packages for Text Mining
# For NLTK, do not forget to download the required ressources (punkts, stopwords) # For NLTK, do not forget to download the required ressources (punkts, stopwords)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Classes ### Classes
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
class Token: class Token:
""" """
Class representing a given token. It stores the string representing the token, its identifier and the number of Class representing a given token. It stores the string representing the token, its identifier and the number of
documents documents
| |
The instance attributes are: The instance attributes are:
token_id: token_id:
Identifier of the token. Identifier of the token.
token: token:
String representing the token. String representing the token.
docs: docs:
Identifiers of documents containing the token. Identifiers of documents containing the token.
""" """
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
token_id: int token_id: int
token: str token: str
docs: List[int] docs: List[int]
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def __init__(self, token_id: int, token: str): def __init__(self, token_id: int, token: str):
""" """
Constructor. Constructor.
:param token_id: Identifier of the token. :param token_id: Identifier of the token.
:param token: String representing the token. :param token: String representing the token.
""" """
self.token_id = token_id self.token_id = token_id
self.token = token self.token = token
self.docs = [] self.docs = []
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def get_idf(self, nb_docs: int) -> float: def get_idf(self, nb_docs: int) -> float:
""" """
Compute the IDF factor of a token. Compute the IDF factor of a token.
:param nb_docs: Total number of documents in the corpus. :param nb_docs: Total number of documents in the corpus.
:return: IDF factor. :return: IDF factor.
""" """
if len(self.docs) == 0: if len(self.docs) == 0:
return 0.0 return 0.0
return math.log(float(nb_docs) / float(len(self.docs))) return math.log(float(nb_docs) / float(len(self.docs)))
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
class Doc: class Doc:
""" """
This class represents an instance of a document. This class represents an instance of a document.
| |
The instance attributes are: The instance attributes are:
url: url:
URL of the document (if defined). URL of the document (if defined).
doc_id: doc_id:
Identifier of the document. Identifier of the document.
text: text:
Text of the document to analyse. Text of the document to analyse.
vector: vector:
Vector representing the document. Vector representing the document.
tokens: tokens:
List of tokens i order of appearances. A same token may appear several times. List of tokens i order of appearances. A same token may appear several times.
""" """
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
url: Optional[str] url: Optional[str]
doc_id: int doc_id: int
text: str text: str
vector: numpy.ndarray vector: numpy.ndarray
tokens: List[Token] tokens: List[Token]
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def __init__(self, doc_id: int, text: str, url: Optional[str] = None): def __init__(self, doc_id: int, text: str, url: Optional[str] = None):
""" """
Constructor. Constructor.
:param doc_id: :param doc_id:
:param text: Text of the document (brut). :param text: Text of the document (brut).
:param url: URL of the document (if any). :param url: URL of the document (if any).
""" """
self.url = url self.url = url
self.doc_id = doc_id self.doc_id = doc_id
self.text = text self.text = text
self.vector = None self.vector = None
self.tokens = None self.tokens = None
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
class DocCorpus: class DocCorpus:
""" """
This class represents a corpus of documents and the corresponding dictionary of tokens contained. This class represents a corpus of documents and the corresponding dictionary of tokens contained.
| |
The instance attributes are: The instance attributes are:
docs: docs:
List of documents. List of documents.
tokens: tokens:
Dictionary of tokens (strings are the key). Dictionary of tokens (strings are the key).
ids: ids:
Dictionary of tokens (identifiers are the key). Dictionary of tokens (identifiers are the key).
method: method:
String representing the method used for analysing ("TF-IDF" or "Doc2Vec"). String representing the method used for analysing ("TF-IDF" or "Doc2Vec").
nb_dims: nb_dims:
Number of dimensions of the semantic space. Number of dimensions of the semantic space.
stopwords: stopwords:
List of stopwords to eliminate from the analysis. By default, it's the classic English list. List of stopwords to eliminate from the analysis. By default, it's the classic English list.
""" """
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
docs = List[Doc] docs = List[Doc]
tokens: Dict[str, Token] tokens: Dict[str, Token]
ids: Dict[int, Token] ids: Dict[int, Token]
method: str method: str
nb_dims: int nb_dims: int
stopwords: List[str] stopwords: List[str]
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def __init__(self): def __init__(self):
""" """
Constructor. Constructor.
""" """
self.docs = [] self.docs = []
self.tokens = dict() self.tokens = dict()
self.ids = dict() self.ids = dict()
self.method = "Doc2Vec" self.method = "Doc2Vec"
self.nb_dims = 0 self.nb_dims = 0
self.n_tokens = 0 self.n_tokens = 0
self.stopwords = stopwords.words('english') self.stopwords = stopwords.words('english')
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def set_method(self, name) -> None: def set_method(self, name) -> None:
""" """
Change the parameter. Change the parameter.
:param name: Name of the method. :param name: Name of the method.
""" """
self.method = name self.method = name
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def add_doc(self, new_doc: str, url: Optional[str] = None) -> None: def add_doc(self, new_doc: str, url: Optional[str] = None) -> None:
""" """
Add a string representing a document to the corpus and provides an Add a string representing a document to the corpus and provides an
identifier to the document. identifier to the document.
:param new_doc: New document. :param new_doc: New document.
:param url: URL of the document (if any) :param url: URL of the document (if any)
""" """
new_id = len(self.docs) new_id = len(self.docs)
self.docs.append(Doc(new_id, new_doc, url)) self.docs.append(Doc(new_id, new_doc, url))
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def add_docs(self, docs: List[str]) -> None: def add_docs(self, docs: List[str]) -> None:
""" """
Add a list of strings representing documents to the corpus. Each document receives an Add a list of strings representing documents to the corpus. Each document receives an
identifier. identifier.
:param docs: List of documents. :param docs: List of documents.
""" """
for cur_doc in docs: for cur_doc in docs:
self.add_doc(cur_doc) self.add_doc(cur_doc)
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def build_vectors(self) -> None: def build_vectors(self) -> None:
""" """
Build the vectors for the documents of the corpus based on the current method. Build the vectors for the documents of the corpus based on the current method.
""" """
if self.method == "Doc2Vec": if self.method == "Doc2Vec":
self.build_doc2vec() self.build_doc2vec()
elif self.method == "TF-IDF": elif self.method == "TF-IDF":
self.build_tf_idf() self.build_tf_idf()
else: else:
raise ValueError("'" + self.method + "': Invalid building method") raise ValueError("'" + self.method + "': Invalid building method")
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def get_term_document_matrix(self) -> numpy.ndarray: def get_term_document_matrix(self) -> numpy.ndarray:
""" """
Build a document-token matrix with the weights as values. Build a document-token matrix with the weights as values.
:return: Document-token matrix. :return: Document-token matrix.
""" """
matrix = numpy.zeros(shape=(len(self.docs),self.nb_dims)) matrix = numpy.zeros(shape=(len(self.docs),self.nb_dims))
for cur_doc in self.docs: for cur_doc in self.docs:
i = 0 i = 0
for token in cur_doc.vector: for token in cur_doc.vector:
matrix[cur_doc.doc_id, i] = token matrix[cur_doc.doc_id, i] = token
i += 1 i += 1
return matrix return matrix
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def add_token(self, cur_doc: Doc, token_str: str) -> None: def add_token(self, cur_doc: Doc, token_str: str) -> None:
"""Add a token in string format to the Doc Corpus """Add a token in string format to the Doc Corpus
Find the identifier of the current token in the dictionary. Find the identifier of the current token in the dictionary.
If not present, create a new Token instance If not present, create a new Token instance
Attach the token to the current document Attach the token to the current document
Finally, link the document to the Token object Finally, link the document to the Token object
:param cur_doc : the current document from which the token is extracted :param cur_doc : the current document from which the token is extracted
:token_str : the token after cleaning steps (stopwords, stemming, ...) :token_str : the token after cleaning steps (stopwords, stemming, ...)
""" """
# Find the identifier of the current token in the dictionary # Find the identifier of the current token in the dictionary
if token_str not in self.tokens.keys(): if token_str not in self.tokens.keys():
token_id = len(self.tokens) token_id = len(self.tokens)
token = Token(token_id, token_str) token = Token(token_id, token_str)
self.tokens[token_str] = token self.tokens[token_str] = token
self.ids[token_id] = token self.ids[token_id] = token
self.n_tokens = len(self.tokens) self.n_tokens = len(self.tokens)
else: else:
token = self.tokens[token_str] token = self.tokens[token_str]
# Add the token # Add the token
cur_doc.tokens.append(token) cur_doc.tokens.append(token)
# Add a reference count if necessary # Add a reference count if necessary
if cur_doc.doc_id not in token.docs: if cur_doc.doc_id not in token.docs:
token.docs.append(cur_doc.doc_id) token.docs.append(cur_doc.doc_id)
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def extract_tokens(self) -> None: def extract_tokens(self) -> None:
""" """
Extract the tokens from the text of the documents. In practice, for each document, the methods Extract the tokens from the text of the documents. In practice, for each document, the methods
do the following steps: do the following steps:
1. The text is transform in lowercase. 1. The text is transform in lowercase.
2. The text is tokenised. 2. The text is tokenised.
3. Stopwords are removed. 3. Stopwords are removed.
The method words incrementally. Once a document is treated, it will not be re-treated in successive The method words incrementally. Once a document is treated, it will not be re-treated in successive
calls. calls.
""" """
# @COMPLETE : create a stemmer # @COMPLETE : create a stemmer
for cur_doc in self.docs: for cur_doc in self.docs:
if cur_doc.tokens is not None: if cur_doc.tokens is not None:
continue continue
cur_doc.tokens = [] cur_doc.tokens = []
text = cur_doc.text text = cur_doc.text
# @COMPLETE : get text to lowercase # @COMPLETE : get text to lowercase
for extracted_token in nltk.word_tokenize(text): for extracted_token in nltk.word_tokenize(text):
# @COMPLETE : Retains only the stem of non stopwords and punctuation # @COMPLETE : Retains only the stem of non stopwords and punctuation
self.add_token(cur_doc, token_str) self.add_token(cur_doc, token_str)
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def build_tf_idf(self) -> None: def build_tf_idf(self) -> None:
""" """
Build the vectors of the corpus using the TF-IDF approach. Build the vectors of the corpus using the TF-IDF approach.
""" """
vectors = [] vectors = []
self.extract_tokens() self.extract_tokens()
# Step 1: For each document, compute the relative frequencies of each token (TF). # Step 1: For each document, compute the relative frequencies of each token (TF).
for cur_doc in self.docs: for cur_doc in self.docs:
vector = dict() # Dictionary representing a vector of pairs (token_id,nb_occurrences) vector = dict() # Dictionary representing a vector of pairs (token_id,nb_occurrences)
nb_occurrences = 0 nb_occurrences = 0
# @COMPLETE : calculate a vector TF for each document and append to vectors # @COMPLETE : calculate a vector TF for each document and append to vectors
# Step 2: Build the TF-IDF vectors by multiplying the relative frequencies by the IDF factor. # Step 2: Build the TF-IDF vectors by multiplying the relative frequencies by the IDF factor.
self.nb_dims = self.n_tokens self.nb_dims = self.n_tokens
for cur_doc in self.docs: for cur_doc in self.docs:
cur_doc.vector = numpy.zeros(shape=self.nb_dims) cur_doc.vector = numpy.zeros(shape=self.nb_dims)
# @COMPLETE : calculate a vector TF-IDF and store it to cur_doc.vector # @COMPLETE : calculate a vector TF-IDF and store it to cur_doc.vector
# Hint : make use of the "get_idf() method of the Token Class" # Hint : make use of the "get_idf() method of the Token Class"
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def build_doc2vec(self) -> None: def build_doc2vec(self) -> None:
""" """
Build the vectors using the doc2vec approach. Build the vectors using the doc2vec approach.
""" """
self.extract_tokens() self.extract_tokens()
corpus = [] corpus = []
for doc in self.docs: for doc in self.docs:
tokens = [] tokens = []
for token in doc.tokens: for token in doc.tokens:
tokens.append(token.token) tokens.append(token.token)
corpus.append(tokens) corpus.append(tokens)
corpus = [ corpus = [
TaggedDocument(words, ['d{}'.format(idx)]) TaggedDocument(words, ['d{}'.format(idx)])
for idx, words in enumerate(corpus) for idx, words in enumerate(corpus)
] ]
# @COMPLETE : create a doc2vec model with 5 dimension and min_count=1 # @COMPLETE : create a doc2vec model with 5 dimension and min_count=1
# Add the resulting vector (mode.dv) in self.docs[i].vector # Add the resulting vector (mode.dv) in self.docs[i].vector
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
class TokenSorter: class TokenSorter:
""" """
Class to sort a list of tokens by a certain value. Class to sort a list of tokens by a certain value.
| |
The instance attributes are: The instance attributes are:
tokens: tokens:
List of tokens to sort. List of tokens to sort.
reverse: reverse:
Must the token be ranked descending (False) or ascending (True) Must the token be ranked descending (False) or ascending (True)
""" """
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
class TokenRef: class TokenRef:
""" """
Class to represent a reference to a token. Class to represent a reference to a token.
""" """
# --------------------------------------------------------------------- # ---------------------------------------------------------------------
token: Token token: Token
value: float value: float
# --------------------------------------------------------------------- # ---------------------------------------------------------------------
def __init__(self, token: Token, value: float): def __init__(self, token: Token, value: float):
self.token = token self.token = token
self.value = value self.value = value
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
tokens: List[TokenRef] tokens: List[TokenRef]
reverse: bool reverse: bool
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def __init__(self): def __init__(self):
""" """
Constructor. Constructor.
""" """
self.tokens = [] self.tokens = []
self.reverse = False self.reverse = False
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def build(self, tokens, value, reverse: bool) -> None: def build(self, tokens, value, reverse: bool) -> None:
""" """
Build the list of token to sort. Build the list of token to sort.
:param tokens: Tokens to sort. :param tokens: Tokens to sort.
:param value: Lambda function that will be used to build the value associated to each token to sort. :param value: Lambda function that will be used to build the value associated to each token to sort.
:param reverse: Should the token be sorted in descending (True) of ascending (False) order. :param reverse: Should the token be sorted in descending (True) of ascending (False) order.
""" """
for token in tokens.values(): for token in tokens.values():
self.add_token(token, value(token)) self.add_token(token, value(token))
self.reverse = reverse self.reverse = reverse
self.sort() self.sort()
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def add_token(self, token: Token, value: float) -> None: def add_token(self, token: Token, value: float) -> None:
""" """
Add a token to the list. Add a token to the list.
:param token: Token to add. :param token: Token to add.
:param value: Value that will be used to sort the tokens. :param value: Value that will be used to sort the tokens.
""" """
self.tokens.append(TokenSorter.TokenRef(token=token, value=float(value))) self.tokens.append(TokenSorter.TokenRef(token=token, value=float(value)))
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def sort(self) -> None: def sort(self) -> None:
""" """
Sort the tokens. Sort the tokens.
""" """
self.tokens.sort(reverse=self.reverse, key=lambda token: token.value) self.tokens.sort(reverse=self.reverse, key=lambda token: token.value)
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def get_token(self, pos: int) -> str: def get_token(self, pos: int) -> str:
""" """
Get a given token of the list. Get a given token of the list.
:param pos: Position of the token in the list. :param pos: Position of the token in the list.
:return: String representing the token. :return: String representing the token.
""" """
return self.tokens[pos].token.token return self.tokens[pos].token.token
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def get_value(self, pos: int) -> str: def get_value(self, pos: int) -> str:
""" """
Get a value of a given token in the list. Get a value of a given token in the list.
:param pos: Position of the token in the list. :param pos: Position of the token in the list.
:return: Value of the token used for the sorting. :return: Value of the token used for the sorting.
""" """
return self.tokens[pos].value return self.tokens[pos].value
# ------------------------------------------------------------------------- # -------------------------------------------------------------------------
def print(self, title: str, nb : int) -> None: def print(self, title: str, nb : int) -> None:
""" """
Print a given number of top ranked tokens with a title and their values. Print a given number of top ranked tokens with a title and their values.
:param title: Title to print. :param title: Title to print.
:param nb: Number of tokens to print. :param nb: Number of tokens to print.
""" """
print(title) print(title)
if nb > len(self.tokens): if nb > len(self.tokens):
nb = len(self.tokens) nb = len(self.tokens)
for i in range(0,nb): for i in range(0,nb):
print(f" Token: {self.get_token(i)} ({self.get_value(i):.2f})") print(f" Token: {self.get_token(i)} ({self.get_value(i):.2f})")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Functions ### Functions
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def print_matrix(name:str, matrix: numpy.ndarray) -> None: def print_matrix(name:str, matrix: numpy.ndarray) -> None:
""" """
Simple method to print a little matrix nicely. Simple method to print a little matrix nicely.
:param name: Name of the matrix. :param name: Name of the matrix.
:param matrix: Matrix to print. :param matrix: Matrix to print.
""" """
nb_lines = matrix.shape[0] nb_lines = matrix.shape[0]
nb_cols = matrix.shape[1] nb_cols = matrix.shape[1]
spaces = " " * (len(name) + 1) spaces = " " * (len(name) + 1)
title_line = nb_lines % 2 title_line = nb_lines % 2
for i in range(0, nb_lines): for i in range(0, nb_lines):
if i == title_line: if i == title_line:
print(name + "=", end="") print(name + "=", end="")
else: else:
print(spaces, end="") print(spaces, end="")
print("( ", end="") print("( ", end="")
for j in range(0, nb_cols): for j in range(0, nb_cols):
print( "{:.3f}".format(matrix[i,j]), end=" ") print( "{:.3f}".format(matrix[i,j]), end=" ")
print(")",) print(")",)
def create_corpus(path: str) -> DocCorpus: def create_corpus(path: str) -> DocCorpus:
""" """
From a list of docs located at path, create a corpus From a list of docs located at path, create a corpus
A DocCorpus document is build and populated with all the "doc" documents A DocCorpus document is build and populated with all the "doc" documents
located at the path located at the path
:param path : string description of the path :param path : string description of the path
:return : DocCorpus representing the corpus of all the documents :return : DocCorpus representing the corpus of all the documents
""" """
# Instantiate a DocCorpus object # Instantiate a DocCorpus object
the_corpus = DocCorpus() the_corpus = DocCorpus()
# Look for all the files in a directory # Look for all the files in a directory
files = [] files = []
dir_to_analyse = "./docs" dir_to_analyse = "./docs"
for (_, _, file_names) in walk(dir_to_analyse): for (_, _, file_names) in walk(dir_to_analyse):
files.extend(file_names) files.extend(file_names)
break break
# Add the context to the corpus # Add the context to the corpus
for doc_to_analyse in files: for doc_to_analyse in files:
# Treat only files beginning with "doc" # Treat only files beginning with "doc"
if doc_to_analyse[:3] != "doc": if doc_to_analyse[:3] != "doc":
continue continue
filename = dir_to_analyse + sep + doc_to_analyse filename = dir_to_analyse + sep + doc_to_analyse
file = open(file=filename, mode="r", encoding="utf-8") file = open(file=filename, mode="r", encoding="utf-8")
the_corpus.add_doc(file.read(), filename) the_corpus.add_doc(file.read(), filename)
return the_corpus return the_corpus
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Exercise 6 : token extraction and dimensionality reduction # Exercise 6 : token extraction and dimensionality reduction
Complete the implemention of implementation the "extract_tokens" method from the DocCorpus class. Complete the implemention of implementation the "extract_tokens" method from the DocCorpus class.
The method should : The method should :
- extract word tokens - extract word tokens
- remove the case (lowercase only) - remove the case (lowercase only)
- remove punctuation - remove punctuation
- remove stopwords - remove stopwords
- perform stemming - perform stemming
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
the_corpus = create_corpus("./docs")# Create a corpus instance the_corpus = create_corpus("../data-collection/docs")# Create a corpus instance
the_corpus.extract_tokens() # Extract the tokens from the corpus the_corpus.extract_tokens() # Extract the tokens from the corpus
# ------------------------------------------------------------------------------------------------- # -------------------------------------------------------------------------------------------------
# Sort the tokens by the number of documents in which they appear # Sort the tokens by the number of documents in which they appear
sort_by_docs = TokenSorter() sort_by_docs = TokenSorter()
sort_by_docs.build(tokens=the_corpus.tokens, value=lambda token: len(token.docs), reverse=True) sort_by_docs.build(tokens=the_corpus.tokens, value=lambda token: len(token.docs), reverse=True)
sort_by_docs.print(title="Most appearing tokens (Nb Documents):",nb=5) sort_by_docs.print(title="Most appearing tokens (Nb Documents):",nb=5)
# Sort the tokens by their idf factor # Sort the tokens by their idf factor
sort_by_iDF = TokenSorter() sort_by_iDF = TokenSorter()
sort_by_iDF.build(tokens=the_corpus.tokens, value=lambda token: token.get_idf(len(the_corpus.docs)), reverse=True) sort_by_iDF.build(tokens=the_corpus.tokens, value=lambda token: token.get_idf(len(the_corpus.docs)), reverse=True)
sort_by_iDF.print(title="Most discriminant tokens (idf):",nb=5) sort_by_iDF.print(title="Most discriminant tokens (idf):",nb=5)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Exercise 7 : vectorization with TF-IDF # Exercise 7 : vectorization with TF-IDF
Complete the implemention of the TF-IDF method from the DocCorpus class. Complete the implemention of the TF-IDF method from the DocCorpus class.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
the_corpus.build_tf_idf() the_corpus.build_tf_idf()
term_document_matrix = the_corpus.get_term_document_matrix() term_document_matrix = the_corpus.get_term_document_matrix()
print_matrix("Docs", term_document_matrix) print_matrix("Docs", term_document_matrix)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Exercise 8 : vectorization with Doc2Vec # Exercise 8 : vectorization with Doc2Vec
Complete the implemention of the Doc2Vec method from the DocCorpus class. Complete the implemention of the Doc2Vec method from the DocCorpus class.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
the_corpus.build_doc2vec() the_corpus.build_doc2vec()
term_document_matrix = the_corpus.get_term_document_matrix() term_document_matrix = the_corpus.get_term_document_matrix()
print_matrix("Docs", term_document_matrix) print_matrix("Docs", term_document_matrix)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Exercise 9 : corpus analysis using Cosine Similarity # Exercise 9 : corpus analysis using Cosine Similarity
Display the similarity between every pair of document. Display the similarity between every pair of document.
Which movies are close to each other ? Which method works best ? Why ? Which movies are close to each other ? Which method works best ? Why ?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def corpus_analysis(corpus: DocCorpus, method: str) -> None: def corpus_analysis(corpus: DocCorpus, method: str) -> None:
""" """
Calculate and display the cosine similarity between every pair of document for a given vectorization method Calculate and display the cosine similarity between every pair of document for a given vectorization method
:param corpus: Corpus to analyse. :param corpus: Corpus to analyse.
:param method: Method to use for the analysis. :param method: Method to use for the analysis.
""" """
print("\n---- " + method + " ----") print("\n---- " + method + " ----")
corpus.set_method(method) corpus.set_method(method)
corpus.build_vectors() corpus.build_vectors()
matrix = corpus.get_term_document_matrix() matrix = corpus.get_term_document_matrix()
# @COMPLETE : compute cosine similarity between every vector of the matrix # @COMPLETE : compute cosine similarity between every vector of the matrix
# #
for i in range(0, len(corpus.docs) - 1): for i in range(0, len(corpus.docs) - 1):
# Take a vector and build a two dimension matrix needed by cosine_similarity # Take a vector and build a two dimension matrix needed by cosine_similarity
vec1 = matrix[i].reshape(1, -1) vec1 = matrix[i].reshape(1, -1)
for j in range(i + 1, len(corpus.docs)): for j in range(i + 1, len(corpus.docs)):
# Take a vector and build a two dimension matrix needed by cosine_similarity # Take a vector and build a two dimension matrix needed by cosine_similarity
vec2 = matrix[j].reshape(1, -1) vec2 = matrix[j].reshape(1, -1)
# Retrieve name of the docs # Retrieve name of the docs
url_i = corpus.docs[i].url url_i = corpus.docs[i].url
url_j = corpus.docs[j].url url_j = corpus.docs[j].url
# Compute and display the similarity # Compute and display the similarity
print("\tSim(doc" + url_i + ",doc" + url_j + ")=" + "{:.3f}".format(cosine_similarity(vec1, vec2)[0, 0])) print("\tSim(doc" + url_i + ",doc" + url_j + ")=" + "{:.3f}".format(cosine_similarity(vec1, vec2)[0, 0]))
# ----------------------------------------------------------------------------------------------------------- # -----------------------------------------------------------------------------------------------------------
corpus_analysis(corpus=the_corpus, method="TF-IDF") corpus_analysis(corpus=the_corpus, method="TF-IDF")
corpus_analysis(corpus=the_corpus, method="Doc2Vec") corpus_analysis(corpus=the_corpus, method="Doc2Vec")
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
0% Chargement en cours ou .
You are about to add 0 people to the discussion. Proceed with caution.
Terminez d'abord l'édition de ce message.
Veuillez vous inscrire ou vous pour commenter