"the_corpus = create_corpus(\"./docs\")# Create a corpus instance\n",
"the_corpus = create_corpus(\"../data-collection/docs\")# Create a corpus instance\n",
"the_corpus.extract_tokens() # Extract the tokens from the corpus\n",
"\n",
"\n",
...
...
%% Cell type:markdown id: tags:
# Text Mining (concepts) - Exercises 6, 7, 8 et 9
In the following notebook, we are aiming at calculating the similarities between the following movies:
- ("Harry Potter 1","671"),
- ("Harry Potter 2","672"),
- ("The lord of the ring 1", "120"),
- ("The Hobbit 1", "49051"),
Using the outcome of Exercise 2, extract a summary of those 4 movies (including actors).
In the present notebook, we will calculate the vectorized version of the 4 documents and calculate the cosine similarities between each pair of movies.
%% Cell type:markdown id: tags:
### Packages
%% Cell type:code id: tags:
``` python
importmath
fromtypingimportDict
fromtypingimportList
fromtypingimportOptional
fromosimportsep
fromosimportwalk
importnumpy
#@COMPLETE : add here missing packages for Text Mining
# For NLTK, do not forget to download the required ressources (punkts, stopwords)
```
%% Cell type:markdown id: tags:
### Classes
%% Cell type:code id: tags:
``` python
classToken:
"""
Class representing a given token. It stores the string representing the token, its identifier and the number of