update evaluator

1224faf3 · Adrien Payen · 2c0d6a12 · 1224faf3
--- a/evaluator.ipynb
+++ b/evaluator.ipynb
@@ -313,31 +313,31 @@
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>baseline_1</th>\n",
-       "      <td>1.563822</td>\n",
-       "      <td>1.787365</td>\n",
-       "      <td>0.046729</td>\n",
+       "      <td>1.517749</td>\n",
+       "      <td>1.745787</td>\n",
+       "      <td>0.056075</td>\n",
       "      <td>99.405607</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>baseline_2</th>\n",
-       "      <td>1.535869</td>\n",
-       "      <td>1.866364</td>\n",
-       "      <td>0.018692</td>\n",
+       "      <td>1.472806</td>\n",
+       "      <td>1.805674</td>\n",
+       "      <td>0.000000</td>\n",
       "      <td>429.942991</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>baseline_3</th>\n",
-       "      <td>0.871233</td>\n",
-       "      <td>1.081468</td>\n",
-       "      <td>0.037383</td>\n",
+       "      <td>0.868666</td>\n",
+       "      <td>1.076227</td>\n",
+       "      <td>0.093458</td>\n",
       "      <td>99.405607</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>baseline_4</th>\n",
-       "      <td>0.729477</td>\n",
-       "      <td>0.926489</td>\n",
-       "      <td>0.158879</td>\n",
-       "      <td>60.583178</td>\n",
+       "      <td>0.713063</td>\n",
+       "      <td>0.912046</td>\n",
+       "      <td>0.074766</td>\n",
+       "      <td>60.349533</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
@@ -345,10 +345,10 @@
      ],
      "text/plain": [
       "                 mae      rmse  hit_rate     novelty\n",
-       "baseline_1  1.563822  1.787365  0.046729   99.405607\n",
-       "baseline_2  1.535869  1.866364  0.018692  429.942991\n",
-       "baseline_3  0.871233  1.081468  0.037383   99.405607\n",
-       "baseline_4  0.729477  0.926489  0.158879   60.583178"
+       "baseline_1  1.517749  1.745787  0.056075   99.405607\n",
+       "baseline_2  1.472806  1.805674  0.000000  429.942991\n",
+       "baseline_3  0.868666  1.076227  0.093458   99.405607\n",
+       "baseline_4  0.713063  0.912046  0.074766   60.349533"
      ]
     },
     "execution_count": 4,
@@ -375,6 +375,24 @@
    "evaluation_report = create_evaluation_report(EvalConfig, sp_ratings, precomputed_dict, AVAILABLE_METRICS)\n",
    "export_evaluation_report(evaluation_report)"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9fbf23fd",
+   "metadata": {},
+   "source": [
+    "Analyzing the provided data on different baselines, several observations can be made across various metrics.\n",
+    "\n",
+    "Firstly, looking at the Mean Absolute Error (MAE), baseline_4 stands out with the lowest value of 0.713063, indicating superior accuracy in predictions compared to the other baselines. Following closely behind is baseline_3 with a MAE of 0.868666, showcasing commendable precision in its predictions.\n",
+    "\n",
+    "Next, considering the Root Mean Square Error (RMSE), baseline_4 again exhibits the best performance with a value of 0.912046, suggesting minimal overall prediction errors. Baseline_3 maintains strong performance here as well, with an RMSE of 1.076227.\n",
+    "\n",
+    "Examining the Hit Rate, baseline_3 leads the pack with 9.35%, signifying a higher success rate in recommendations compared to the other baselines. Meanwhile, baseline_1 and baseline_4 show lower hit rates at 5.61% and 7.48% respectively.\n",
+    "\n",
+    "Lastly, looking at the Novelty metric, baseline_4 scores the lowest at 60.35, indicating that its recommendations are less novel or more conventional compared to the others. On the other hand, baseline_1 scores the highest in novelty at 99.41, implying that its recommendations are more diverse or less conventional.\n",
+    "\n",
+    "In summary, baseline_4 appears to excel in several metrics including MAE, RMSE, and maintaining relatively low novelty. Baseline_3 stands out with a higher hit rate, showcasing effectiveness in recommendation success. Baseline_2, despite not excelling in the other metrics, exhibits an exceptionally high novelty score, indicating a unique approach to recommendations compared to the rest."
+   ]
  }
 ],
 "metadata": {

 %% Cell type:markdown id:a665885b tags:

 # Evaluator Module
 The Evaluator module creates evaluation reports.

 Reports contain evaluation metrics depending on models specified in the evaluation config.

 %% Cell type:code id:6aaf9140 tags:

 ``` python
 # reloads modules automatically before entering the execution of code
 %load_ext autoreload
 %autoreload 2

 # imports
 import numpy as np
 import pandas as pd

 # local imports
 from configs import EvalConfig
 from constants import Constant as C
 from loaders import export_evaluation_report
 from loaders import load_ratings

 # New imports
 from surprise.model_selection import train_test_split
 from surprise import accuracy
 from surprise.model_selection import LeaveOneOut
 from collections import Counter
 ```

 %% Cell type:markdown id:d47c24a4 tags:

 # 1. Model validation functions
 Validation functions are a way to perform crossvalidation on recommender system models.

 %% Cell type:code id:d6d82188 tags:

 ``` python
 # -- implement the function generate_split_predictions --
 def generate_split_predictions(algo, ratings_dataset, eval_config):
    """Generate predictions on a random test set specified in eval_config"""

    # Spliting the data into train and test sets
    trainset, testset = train_test_split(ratings_dataset, test_size=eval_config.test_size)

    # Training the algorithm on the train data set
    algo.fit(trainset)

    # Predict ratings for the testset
    predictions = algo.test(testset)

    return predictions

 # -- implement the function generate_loo_top_n --
 def generate_loo_top_n(algo, ratings_dataset, eval_config):
    """Generate top-n recommendations for each user on a random Leave-one-out split (LOO)"""

    # Create a LeaveOneOut split
    loo = LeaveOneOut(n_splits=1)

    for trainset, testset in loo.split(ratings_dataset):
        algo.fit(trainset)  # Train the algorithm on the training set
        anti_testset = trainset.build_anti_testset()  # Build the anti test-set
        predictions = algo.test(anti_testset)  # Get predictions on the anti test-set
        top_n = {}
        for uid, iid, _, est, _ in predictions:
            if uid not in top_n:
                top_n[uid] = []
            top_n[uid].append((iid, est))
        for uid, user_ratings in top_n.items():
            user_ratings.sort(key=lambda x: x[1], reverse=True)
            top_n[uid] = user_ratings[:eval_config.top_n_value]  # Get top-N recommendations
        anti_testset_top_n = top_n
        return anti_testset_top_n, testset

 def generate_full_top_n(algo, ratings_dataset, eval_config):
    """Generate top-n recommendations for each user with full training set (LOO)"""

    full_trainset = ratings_dataset.build_full_trainset()  # Build the full training set
    algo.fit(full_trainset)  # Train the algorithm on the full training set
    anti_testset = full_trainset.build_anti_testset()  # Build the anti test-set
    predictions = algo.test(anti_testset)  # Get predictions on the anti test-set
    top_n = {}
    for uid, iid, _, est, _ in predictions:
        if uid not in top_n:
            top_n[uid] = []
        top_n[uid].append((iid, est))
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:eval_config.top_n_value]  # Get top-N recommendations
    anti_testset_top_n = top_n
    return anti_testset_top_n

 def precomputed_information(movie_data):

    """ Returns a dictionary that precomputes relevant information for evaluating in full mode

    Dictionary keys:
    - precomputed_dict["item_to_rank"] : contains a dictionary mapping movie ids to rankings
    - (-- for your project, add other relevant information here -- )
    """

    # Initialize an empty dictionary to store item_id to rank mapping
    item_to_rank = {}

    # Calculate popularity rank for each movie
    ratings_count = movie_data.groupby('movieId').size().sort_values(ascending=False)

    # Assign ranks to movies based on their popularity
    for rank, (movie_id, _) in enumerate(ratings_count.items(), start=1):
        item_to_rank[movie_id] = rank

    # Create the precomputed dictionary
    precomputed_dict = {}
    precomputed_dict["item_to_rank"] = item_to_rank

    return precomputed_dict

 def create_evaluation_report(eval_config, sp_ratings, precomputed_dict, available_metrics):

    """ Create a DataFrame evaluating various models on metrics specified in an evaluation config.
    """

    evaluation_dict = {}
    for model_name, model, arguments in eval_config.models:
        print(f'Handling model {model_name}')
        algo = model(**arguments)
        evaluation_dict[model_name] = {}

        # Type 1 : split evaluations
        if len(eval_config.split_metrics) > 0:
            print('Training split predictions')
            predictions = generate_split_predictions(algo, sp_ratings, eval_config)
            for metric in eval_config.split_metrics:
                print(f'- computing metric {metric}')
                assert metric in available_metrics['split']
                evaluation_function, parameters =  available_metrics["split"][metric]
                evaluation_dict[model_name][metric] = evaluation_function(predictions, **parameters)

        # Type 2 : loo evaluations
        if len(eval_config.loo_metrics) > 0:
            print('Training loo predictions')
            anti_testset_top_n, testset = generate_loo_top_n(algo, sp_ratings, eval_config)
            for metric in eval_config.loo_metrics:
                assert metric in available_metrics['loo']
                evaluation_function, parameters =  available_metrics["loo"][metric]
                evaluation_dict[model_name][metric] = evaluation_function(anti_testset_top_n, testset, **parameters)

        # Type 3 : full evaluations
        if len(eval_config.full_metrics) > 0:
            print('Training full predictions')
            anti_testset_top_n = generate_full_top_n(algo, sp_ratings, eval_config)
            for metric in eval_config.full_metrics:
                assert metric in available_metrics['full']
                evaluation_function, parameters =  available_metrics["full"][metric]
                evaluation_dict[model_name][metric] = evaluation_function(
                    anti_testset_top_n,
                    **precomputed_dict,
                    **parameters
                )

    return pd.DataFrame.from_dict(evaluation_dict).T
 ```

 %% Cell type:markdown id:f7e83d1d tags:

 # 2. Evaluation metrics
 Implement evaluation metrics for either rating predictions (split metrics) or for top-n recommendations (loo metric, full metric)

 %% Cell type:code id:f1849e55 tags:

 ``` python
 # -- implement the function get_hit_rate --
 def get_hit_rate(anti_testset_top_n, testset):

    """Compute the average hit over the users (loo metric)

    A hit (1) happens when the movie in the testset has been picked by the top-n recommender
    A fail (0) happens when the movie in the testset has not been picked by the top-n recommender
    """

    hits = 0
    total_users = len(testset)
    for uid, true_iid, _ in testset:
        if uid in anti_testset_top_n and true_iid in {iid for iid, _ in anti_testset_top_n[uid]}:
            hits += 1
    hit_rate = hits / total_users

    return hit_rate

 # -- implement the function get_novelty --
 def get_novelty(anti_testset_top_n, item_to_rank):

    """Compute the average novelty of the top-n recommendation over the users (full metric)

    The novelty is defined as the average ranking of the movies recommended
    """

    total_rank_sum = 0
    total_recommendations = 0
    for uid, recommendations in anti_testset_top_n.items():
        for iid, _ in recommendations:
            if iid in item_to_rank:
                total_rank_sum += item_to_rank[iid]
                total_recommendations += 1
    if total_recommendations == 0:
        return 0  # Avoid division by zero
    average_rank_sum = total_rank_sum / total_recommendations

    return average_rank_sum
 ```

 %% Cell type:markdown id:1a9855b3 tags:

 # 3. Evaluation workflow
 Load data, evaluate models and save the experimental outcomes

 %% Cell type:code id:704f4d2a tags:

 ``` python
 AVAILABLE_METRICS = {
    "split": {
        "mae": (accuracy.mae, {'verbose': False}),
        "rmse": (accuracy.rmse, {'verbose': False})
    },
    "loo": {
        "hit_rate": (get_hit_rate, {}),
    },
    "full": {
        "novelty": (get_novelty, {}),
    }
 }

 sp_ratings = load_ratings(surprise_format=True)
 precomputed_dict = precomputed_information(pd.read_csv("data/tiny/evidence/ratings.csv"))
 evaluation_report = create_evaluation_report(EvalConfig, sp_ratings, precomputed_dict, AVAILABLE_METRICS)
 export_evaluation_report(evaluation_report)
 ```

 %% Output

    Handling model baseline_1
    Training split predictions
    - computing metric mae
    - computing metric rmse
    Training loo predictions
    Training full predictions
    Handling model baseline_2
    Training split predictions
    - computing metric mae
    - computing metric rmse
    Training loo predictions
    Training full predictions
    Handling model baseline_3
    Training split predictions
    - computing metric mae
    - computing metric rmse
    Training loo predictions
    Training full predictions
    Handling model baseline_4
    Training split predictions
    - computing metric mae
    - computing metric rmse
    Training loo predictions
    Training full predictions
    The data has been exported to the evaluation report

                     mae      rmse  hit_rate     novelty
-    baseline_1  1.563822  1.787365  0.046729   99.405607
-    baseline_2  1.535869  1.866364  0.018692  429.942991
-    baseline_3  0.871233  1.081468  0.037383   99.405607
-    baseline_4  0.729477  0.926489  0.158879   60.583178
+    baseline_1  1.517749  1.745787  0.056075   99.405607
+    baseline_2  1.472806  1.805674  0.000000  429.942991
+    baseline_3  0.868666  1.076227  0.093458   99.405607
+    baseline_4  0.713063  0.912046  0.074766   60.349533
+
+%% Cell type:markdown id:9fbf23fd tags:
+
+Analyzing the provided data on different baselines, several observations can be made across various metrics.
+
+Firstly, looking at the Mean Absolute Error (MAE), baseline_4 stands out with the lowest value of 0.713063, indicating superior accuracy in predictions compared to the other baselines. Following closely behind is baseline_3 with a MAE of 0.868666, showcasing commendable precision in its predictions.
+
+Next, considering the Root Mean Square Error (RMSE), baseline_4 again exhibits the best performance with a value of 0.912046, suggesting minimal overall prediction errors. Baseline_3 maintains strong performance here as well, with an RMSE of 1.076227.
+
+Examining the Hit Rate, baseline_3 leads the pack with 9.35%, signifying a higher success rate in recommendations compared to the other baselines. Meanwhile, baseline_1 and baseline_4 show lower hit rates at 5.61% and 7.48% respectively.
+
+Lastly, looking at the Novelty metric, baseline_4 scores the lowest at 60.35, indicating that its recommendations are less novel or more conventional compared to the others. On the other hand, baseline_1 scores the highest in novelty at 99.41, implying that its recommendations are more diverse or less conventional.
+
+In summary, baseline_4 appears to excel in several metrics including MAE, RMSE, and maintaining relatively low novelty. Baseline_3 stands out with a higher hit rate, showcasing effectiveness in recommendation success. Baseline_2, despite not excelling in the other metrics, exhibits an exceptionally high novelty score, indicating a unique approach to recommendations compared to the rest.