"the_corpus = create_corpus(\"./docs\")# Create a corpus instance\n",
"the_corpus = create_corpus(\"../data-collection/docs\")# Create a corpus instance\n",
"the_corpus.extract_tokens() # Extract the tokens from the corpus\n",
"the_corpus.extract_tokens() # Extract the tokens from the corpus\n",
"\n",
"\n",
"\n",
"\n",
...
...
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Text Mining (concepts) - Exercises 6, 7, 8 et 9
# Text Mining (concepts) - Exercises 6, 7, 8 et 9
In the following notebook, we are aiming at calculating the similarities between the following movies:
In the following notebook, we are aiming at calculating the similarities between the following movies:
- ("Harry Potter 1","671"),
- ("Harry Potter 1","671"),
- ("Harry Potter 2","672"),
- ("Harry Potter 2","672"),
- ("The lord of the ring 1", "120"),
- ("The lord of the ring 1", "120"),
- ("The Hobbit 1", "49051"),
- ("The Hobbit 1", "49051"),
Using the outcome of Exercise 2, extract a summary of those 4 movies (including actors).
Using the outcome of Exercise 2, extract a summary of those 4 movies (including actors).
In the present notebook, we will calculate the vectorized version of the 4 documents and calculate the cosine similarities between each pair of movies.
In the present notebook, we will calculate the vectorized version of the 4 documents and calculate the cosine similarities between each pair of movies.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Packages
### Packages
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
importmath
importmath
fromtypingimportDict
fromtypingimportDict
fromtypingimportList
fromtypingimportList
fromtypingimportOptional
fromtypingimportOptional
fromosimportsep
fromosimportsep
fromosimportwalk
fromosimportwalk
importnumpy
importnumpy
#@COMPLETE : add here missing packages for Text Mining
#@COMPLETE : add here missing packages for Text Mining
# For NLTK, do not forget to download the required ressources (punkts, stopwords)
# For NLTK, do not forget to download the required ressources (punkts, stopwords)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Classes
### Classes
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
classToken:
classToken:
"""
"""
Class representing a given token. It stores the string representing the token, its identifier and the number of
Class representing a given token. It stores the string representing the token, its identifier and the number of