Skip to content
GitLab
Explorer
Connexion
S'inscrire
Navigation principale
Rechercher ou aller à…
Projet
MLSMM2153_Web-Mining
Gestion
Activité
Membres
Labels
Programmation
Tickets
Tableaux des tickets
Jalons
Wiki
Code
Requêtes de fusion
Dépôt
Branches
Validations
Étiquettes
Graphe du dépôt
Comparer les révisions
Extraits de code
Compilation
Pipelines
Jobs
Planifications de pipeline
Artéfacts
Déploiement
Releases
Registre de paquets
Registre de conteneur
Registre de modèles
Opération
Environnements
Modules Terraform
Surveillance
Incidents
Analyse
Données d'analyse des chaînes de valeur
Analyse des contributeurs
Données d'analyse CI/CD
Données d'analyse du dépôt
Expériences du modèle
Aide
Aide
Support
Documentation de GitLab
Comparer les forfaits GitLab
Forum de la communauté
Contribuer à GitLab
Donner votre avis
Conditions générales et politique de confidentialité
Raccourcis clavier
?
Extraits de code
Groupes
Projets
Afficher davantage de fils d'Ariane
Corentin Vande Kerckhove
MLSMM2153_Web-Mining
Validations
861b9f4e
Valider
861b9f4e
rédigé
2 years ago
par
Corentin Vande Kerckhove
Parcourir les fichiers
Options
Téléchargements
Correctifs
Plain Diff
finish to clean first exercise
parent
a97cdcda
Aucune branche associée trouvée
Aucune étiquette associée trouvée
Aucune requête de fusion associée trouvée
Modifications
1
Masquer les modifications d'espaces
En ligne
Côte à côte
Affichage de
1 fichier modifié
a-data-collection/exercise1.ipynb
+129
-609
129 ajouts, 609 suppressions
a-data-collection/exercise1.ipynb
avec
129 ajouts
et
609 suppressions
a-data-collection/exercise1.ipynb
+
129
−
609
Voir le fichier @
861b9f4e
...
...
@@ -5,688 +5,208 @@
"metadata": {},
"source": [
"# Exercise 1 - Parsing HTML\n",
"The following notebook is
greatly
inspired by https://github.com/khuyentran1401/Web-Scrapping-Wikipedia"
"The following notebook is inspired by https://github.com/khuyentran1401/Web-Scrapping-Wikipedia"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import urllib.request\n",
"import time\n",
"from bs4 import BeautifulSoup\n",
"import numpy as np\n",
"import pandas as pd\n",
"from urllib.request import urlopen\n",
"\n",
"DATA_REPO = "
]
},
{
"cell_type": "code",
"execution_count": 2,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"
url = 'https://en.wikipedia.org/wiki/Epidemiology_of_depress
ion
'
"
"
### Packages , Paths and Funct
ion
s
"
]
},
{
"cell_type": "code",
"execution_count":
3
,
"execution_count":
null
,
"metadata": {},
"outputs": [],
"source": [
"html = urlopen(url)"
"import time\n",
"import re\n",
"from pathlib import Path\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"from bs4 import BeautifulSoup\n",
"import seaborn as sns\n",
"\n",
"DATA_PATH = Path('../data/')\n",
"\n",
"# Epidemiology webpage : https://en.wikipedia.org/wiki/Epidemiology_of_depression\n",
"DEPRESSION_FILENAME = 'a1_epidemiology_of_depression.html' # Stored locally\n",
"\n",
"# Epidemiology webpage : https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration\n",
"SUNSHINE_FILENAME = 'a1_city_sunshine_duration.html' # Stored locally\n",
"\n",
"def read_html_file(path: Path, filename: str) -> str:\n",
" \"\"\"Read an HTML stored locally\"\"\"\n",
" with open(path / filename, \"r\") as file:\n",
" return file.read()\n",
"\n",
"def process_num(string_number : str) -> float:\n",
" \"\"\"Convert a string number formatted with a comma to separate thousands\n",
" \n",
" Example : 1,823.0 -> 1823.0\"\"\"\n",
" return float(re.sub(r'[^\\w\\s.]','', string_number))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"
soup = BeautifulSoup(html, 'html.parser')
"
"
## 1 - Create depression table
"
]
},
{
"cell_type": "code",
"execution_count":
5
,
"execution_count":
null
,
"metadata": {},
"outputs": [],
"source": [
"tables = soup.find_all('table')"
"depression_html = read_html_file(DATA_PATH, DEPRESSION_FILENAME)\n",
"depression_soup = BeautifulSoup(depression_html, 'html.parser')"
]
},
{
"cell_type": "code",
"execution_count":
6
,
"execution_count":
null
,
"metadata": {},
"outputs": [],
"source": [
"#convert number as string to integer\n",
"#re.sub() returns the substring that match the regrex\n",
"import re\n",
"def process_num(num):\n",
" return float(re.sub(r'[^\\w\\s.]','',num))\n"
"depression_rates = [] # preparing list to contain the different depression rates\n",
"depression_countries = [] # preparing list to contain the different country names\n",
"\n",
"COUNTRY_POSITION_IN_DEP_TABLE = 0\n",
"RATE_POSITION_IN_DEP_TABLE = 2\n",
"\n",
"def extract_depression_rates(depression_soup: BeautifulSoup) -> pd.DataFrame:\n",
" \"\"\"Extract depression rates from soup build from Wikipedia depression table\n",
" \"\"\"\n",
" \n",
" # Extract the table from the soup\n",
" tables = depression_soup.find_all('table')\n",
" depression_table = tables[0] # ignore the glossary at the end\n",
" \n",
" # Loop over rows\n",
" ## @COMPLETE : extract all the rows\n",
" # table_rows = ...\n",
" for table_row in table_rows:\n",
" ## @COMPLETE : extract all the cells\n",
" # table_cells = ...\n",
"\n",
" if len(table_cells) > 1:\n",
" country = table_cells[COUNTRY_POSITION_IN_DEP_TABLE]\n",
" depression_countries.append(country.text.strip())\n",
"\n",
" rate = table_cells[RATE_POSITION_IN_DEP_TABLE]\n",
" depression_rates.append(round(float(rate.text.strip())))\n",
" return pd.DataFrame(depression_rates, index= depression_countries, columns = ['DALY rate'])\n",
"\n",
"df_depression = extract_depression_rates(depression_soup)\n",
"print(f'Extracted depression data for {df_depression.shape[0]} countries')\n",
"display(df_depression.head())"
]
},
{
"cell_type": "code",
"execution_count": 7,
"cell_type": "markdown",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'1156.30'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num1 = re.sub(r'[^\\w\\s.]','','1,156.30')\n",
"num1"
"## 2 - Create sunshine table"
]
},
{
"cell_type": "code",
"execution_count":
9
,
"execution_count":
null
,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Rank</th>\n",
" <th>DALY rate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>United States</td>\n",
" <td>1</td>\n",
" <td>1454.74</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Nepal</td>\n",
" <td>2</td>\n",
" <td>1424.48</td>\n",
" </tr>\n",
" <tr>\n",
" <td>East Timor</td>\n",
" <td>3</td>\n",
" <td>1404.10</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Bangladesh</td>\n",
" <td>4</td>\n",
" <td>1401.53</td>\n",
" </tr>\n",
" <tr>\n",
" <td>India</td>\n",
" <td>5</td>\n",
" <td>1400.84</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Pakistan</td>\n",
" <td>6</td>\n",
" <td>1400.42</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Brazil</td>\n",
" <td>7</td>\n",
" <td>1396.10</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Maldives</td>\n",
" <td>8</td>\n",
" <td>1391.61</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Bhutan</td>\n",
" <td>9</td>\n",
" <td>1385.53</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Afghanistan</td>\n",
" <td>10</td>\n",
" <td>1385.14</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Rank DALY rate\n",
"United States 1 1454.74\n",
"Nepal 2 1424.48\n",
"East Timor 3 1404.10\n",
"Bangladesh 4 1401.53\n",
"India 5 1400.84\n",
"Pakistan 6 1400.42\n",
"Brazil 7 1396.10\n",
"Maldives 8 1391.61\n",
"Bhutan 9 1385.53\n",
"Afghanistan 10 1385.14"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"ranks = []\n",
"rates = []\n",
"countries = []\n",
"links = []\n",
"\n",
"for table in tables:\n",
" rows = table.find_all('tr')\n",
" \n",
" for row in rows:\n",
" cells = row.find_all('td')\n",
" \n",
" if len(cells) > 1:\n",
" rank = cells[0]\n",
" ranks.append(int(rank.text))\n",
" \n",
" country = cells[1]\n",
" countries.append(country.text.strip())\n",
" \n",
" rate = cells[2]\n",
" rates.append(process_num(rate.text.strip()))\n",
" \n",
" link = cells[1].find('a').get('href')\n",
" links.append('https://en.wikipedia.org/'+ link)\n",
" \n",
"df1 = pd.DataFrame(ranks, index= countries, columns = ['Rank'])\n",
"df1['DALY rate'] = rates\n",
"\n",
"df1.head(10)"
"sunshine_html = read_html_file(DATA_PATH, SUNSHINE_FILENAME)\n",
"sunshine_soup = BeautifulSoup(sunshine_html, 'html.parser')"
]
},
{
"cell_type": "code",
"execution_count":
10
,
"execution_count":
null
,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"789.14 3\n",
"Country: Benin, Sunshine Hours: 263.05\n",
"515.99 2\n",
"Country: Togo, Sunshine Hours: 258.0\n",
"710.25 3\n",
"Country: Ghana, Sunshine Hours: 236.75\n",
"866.0500000000001 4\n",
"Country: Cameroon, Sunshine Hours: 216.51\n",
"344.03999999999996 2\n",
"Country: Gabon, Sunshine Hours: 172.02\n",
"1334.54 5\n",
"Country: Nigeria, Sunshine Hours: 266.91\n",
"711.91 2\n",
"Country: Sudan, Sunshine Hours: 355.95\n",
"336.1 1\n",
"Country: Eritrea, Sunshine Hours: 336.1\n",
"641.8 2\n",
"Country: Burkina Faso, Sunshine Hours: 320.9\n",
"320.32 1\n",
"Country: Niger, Sunshine Hours: 320.32\n",
"670.6400000000001 2\n",
"Country: Chad, Sunshine Hours: 335.32\n",
"307.0 1\n",
"Country: Gambia, Sunshine Hours: 307.0\n",
"629.2 2\n",
"Country: Senegal, Sunshine Hours: 314.6\n",
"620.5999999999999 2\n",
"Country: Somalia, Sunshine Hours: 310.3\n",
"327.9 1\n",
"Country: Djibouti, Sunshine Hours: 327.9\n",
"964.0099999999999 3\n",
"Country: Mali, Sunshine Hours: 321.34\n",
"653.3 2\n",
"Country: Algeria, Sunshine Hours: 326.65\n",
"609.99 2\n",
"Country: Tunisia, Sunshine Hours: 305.0\n",
"946.64 3\n",
"Country: Morocco, Sunshine Hours: 315.55\n",
"2253.8500000000004 6\n",
"Country: Egypt, Sunshine Hours: 375.64\n",
"635.6199999999999 2\n",
"Country: Libya, Sunshine Hours: 317.81\n",
"1212.01 4\n",
"Country: Kenya, Sunshine Hours: 303.0\n",
"234.1 1\n",
"Country: Angola, Sunshine Hours: 234.1\n",
"1213.1399999999999 4\n",
"Country: Tanzania, Sunshine Hours: 303.28\n",
"556.97 2\n",
"Country: Ethiopia, Sunshine Hours: 278.49\n",
"666.5 2\n",
"Country: Mauritania, Sunshine Hours: 333.25\n",
"1884.79 6\n",
"Country: South Africa, Sunshine Hours: 314.13\n",
"1028.0 3\n",
"Country: Botswana, Sunshine Hours: 342.67\n",
"889.6400000000001 3\n",
"Country: Zambia, Sunshine Hours: 296.55\n",
"613.08 2\n",
"Country: Zimbabwe, Sunshine Hours: 306.54\n",
"838.76 3\n",
"Country: Malawi, Sunshine Hours: 279.59\n",
"1718.66 6\n",
"Country: Madagascar, Sunshine Hours: 286.44\n",
"283.8 1\n",
"Country: Mozambique, Sunshine Hours: 283.8\n",
"681.8 3\n",
"Country: Uganda, Sunshine Hours: 227.27\n",
"237.34 1\n",
"Country: Burundi, Sunshine Hours: 237.34\n",
"488.0 2\n",
"Country: Guinea, Sunshine Hours: 244.0\n",
"270.7 1\n",
"Country: Guinea-Bissau, Sunshine Hours: 270.7\n",
"309.79 2\n",
"Country: Equatorial Guinea, Sunshine Hours: 154.9\n",
"747.5 2\n",
"Country: Namibia, Sunshine Hours: 373.75\n",
"317.51 1\n",
"Country: Afghanistan, Sunshine Hours: 317.51\n",
"220.74 1\n",
"Country: Azerbaijan, Sunshine Hours: 220.74\n",
"206.6 1\n",
"Country: Bangladesh, Sunshine Hours: 206.6\n",
"1091.49 5\n",
"Country: China, Sunshine Hours: 218.3\n",
"973.66 4\n",
"Country: India, Sunshine Hours: 243.41\n",
"298.33000000000004 1\n",
"Country: Indonesia, Sunshine Hours: 298.33\n",
"282.61 1\n",
"Country: Iran, Sunshine Hours: 282.61\n",
"324.08000000000004 1\n",
"Country: Iraq, Sunshine Hours: 324.08\n",
"331.1 1\n",
"Country: Israel, Sunshine Hours: 331.1\n",
"361.71000000000004 2\n",
"Country: Japan, Sunshine Hours: 180.86\n",
"486.29999999999995 2\n",
"Country: Kazakhstan, Sunshine Hours: 243.15\n",
"279.15 1\n",
"Country: Mongolia, Sunshine Hours: 279.15\n",
"249.2 1\n",
"Country: North Korea, Sunshine Hours: 249.2\n",
"349.33000000000004 1\n",
"Country: Oman, Sunshine Hours: 349.33\n",
"598.4300000000001 2\n",
"Country: Pakistan, Sunshine Hours: 299.22\n",
"210.31 1\n",
"Country: Philippines, Sunshine Hours: 210.31\n",
"1578.2299999999998 8\n",
"Country: Russia, Sunshine Hours: 197.28\n",
"647.3 2\n",
"Country: Saudi Arabia, Sunshine Hours: 323.65\n",
"202.24 1\n",
"Country: Singapore, Sunshine Hours: 202.24\n",
"439.33000000000004 2\n",
"Country: South Korea, Sunshine Hours: 219.67\n",
"870.0099999999999 4\n",
"Country: Thailand, Sunshine Hours: 217.5\n",
"466.76 2\n",
"Country: Turkey, Sunshine Hours: 233.38\n",
"282.39 1\n",
"Country: Uzbekistan, Sunshine Hours: 282.39\n",
"849.4 4\n",
"Country: Vietnam, Sunshine Hours: 212.35\n",
"254.4 1\n",
"Country: Albania, Sunshine Hours: 254.4\n",
"247.4 1\n",
"Country: Armenia, Sunshine Hours: 247.4\n",
"188.4 1\n",
"Country: Austria, Sunshine Hours: 188.4\n",
"180.7 1\n",
"Country: Belarus, Sunshine Hours: 180.7\n",
"154.6 1\n",
"Country: Belgium, Sunshine Hours: 154.6\n",
"176.9 1\n",
"Country: Bosnia and Herzegovina, Sunshine Hours: 176.9\n",
"217.7 1\n",
"Country: Bulgaria, Sunshine Hours: 217.7\n",
"191.3 1\n",
"Country: Croatia, Sunshine Hours: 191.3\n",
"166.8 1\n",
"Country: Czech Republic, Sunshine Hours: 166.8\n",
"331.40999999999997 1\n",
"Country: Cyprus, Sunshine Hours: 331.41\n",
"173.9 1\n",
"Country: Denmark, Sunshine Hours: 173.9\n",
"182.6 1\n",
"Country: Estonia, Sunshine Hours: 182.6\n",
"185.8 1\n",
"Country: Finland, Sunshine Hours: 185.8\n",
"449.8 2\n",
"Country: France, Sunshine Hours: 224.9\n",
"204.6 1\n",
"Country: Georgia, Sunshine Hours: 204.6\n",
"328.79999999999995 2\n",
"Country: Germany, Sunshine Hours: 164.4\n",
"595.0 2\n",
"Country: Greece, Sunshine Hours: 297.5\n",
"198.8 1\n",
"Country: Hungary, Sunshine Hours: 198.8\n",
"132.6 1\n",
"Country: Iceland, Sunshine Hours: 132.6\n",
"145.3 1\n",
"Country: Ireland, Sunshine Hours: 145.3\n",
"438.8 2\n",
"Country: Italy, Sunshine Hours: 219.4\n",
"175.4 1\n",
"Country: Latvia, Sunshine Hours: 175.4\n",
"169.1 1\n",
"Country: Lithuania, Sunshine Hours: 169.1\n",
"305.4 1\n",
"Country: Malta, Sunshine Hours: 305.4\n",
"212.6 1\n",
"Country: Moldova, Sunshine Hours: 212.6\n",
"166.2 1\n",
"Country: Netherlands, Sunshine Hours: 166.2\n",
"166.8 1\n",
"Country: Norway, Sunshine Hours: 166.8\n",
"157.1 1\n",
"Country: Poland, Sunshine Hours: 157.1\n",
"280.6 1\n",
"Country: Portugal, Sunshine Hours: 280.6\n",
"211.5 1\n",
"Country: Romania, Sunshine Hours: 211.5\n",
"203.8 1\n",
"Country: Slovakia, Sunshine Hours: 203.8\n",
"197.4 1\n",
"Country: Slovenia, Sunshine Hours: 197.4\n",
"826.6 3\n",
"Country: Spain, Sunshine Hours: 275.53\n",
"374.29999999999995 2\n",
"Country: Sweden, Sunshine Hours: 187.15\n",
"156.6 1\n",
"Country: Switzerland, Sunshine Hours: 156.6\n",
"195.5 1\n",
"Country: Ukraine, Sunshine Hours: 195.5\n",
"306.0 2\n",
"Country: United Kingdom, Sunshine Hours: 153.0\n",
"1825.24 9\n",
"Country: Canada, Sunshine Hours: 202.8\n",
"225.98000000000002 1\n",
"Country: Honduras, Sunshine Hours: 225.98\n",
"1038.5 4\n",
"Country: Mexico, Sunshine Hours: 259.62\n",
"275.99 1\n",
"Country: Nicaragua, Sunshine Hours: 275.99\n",
"174.35 1\n",
"Country: Panama, Sunshine Hours: 174.35\n",
"295.7 1\n",
"Country: El Salvador, Sunshine Hours: 295.7\n",
"15218.579999999998 54\n",
"Country: United States, Sunshine Hours: 281.83\n",
"1149.52 5\n",
"Country: Argentina, Sunshine Hours: 229.9\n",
"228.89000000000001 1\n",
"Country: Bolivia, Sunshine Hours: 228.89\n",
"1322.58 6\n",
"Country: Brazil, Sunshine Hours: 220.43\n",
"953.81 6\n",
"Country: Colombia, Sunshine Hours: 158.97\n",
"1324.27 5\n",
"Country: Chile, Sunshine Hours: 264.85\n",
"381.90999999999997 2\n",
"Country: Ecuador, Sunshine Hours: 190.95\n",
"280.3 1\n",
"Country: Paraguay, Sunshine Hours: 280.3\n",
"604.0 3\n",
"Country: Peru, Sunshine Hours: 201.33\n",
"248.14000000000001 1\n",
"Country: Uruguay, Sunshine Hours: 248.14\n",
"579.0899999999999 2\n",
"Country: Venezuela, Sunshine Hours: 289.54\n",
"2553.15 9\n",
"Country: Australia, Sunshine Hours: 283.68\n",
"192.2 1\n",
"Country: Fiji, Sunshine Hours: 192.2\n",
"613.1999999999999 3\n",
"Country: New Zealand, Sunshine Hours: 204.4\n",
"246.3 1\n",
"Country: Papua New Guinea, Sunshine Hours: 246.3\n",
"233.0 1\n",
"Country: Solomon Islands, Sunshine Hours: 233.0\n"
]
}
],
"outputs": [],
"source": [
"sun_url = urlopen('https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration')\n",
"sun = BeautifulSoup(sun_url, 'html.parser')\n",
"tables = sun.find_all('table')\n",
"\n",
"#Dictionary to hold the name of the country and its corresponding temperature\n",
"country_suns = {}\n",
"country_sunshine = {}\n",
"\n",
"COUNTRY_POSITION_IN_SUN_TABLE = 0\n",
"SUNSHINE_POSITION_IN_SUN_TABLE = -2\n",
"\n",
"#Dictionary to hold the country and its frequency in the table\n",
"count = {}\n",
"for table in tables:\n",
" if len(table) >1:\n",
" rows = table.find_all('tr')\n",
" \n",
" #Skip the first row, which is the name of the columns\n",
" for row in rows[1:]:\n",
" cells = row.find_all('td')\n",
" country = cells[0].text.strip()\n",
"def extract_monthly_sunshine_hours(sunshine_soup: BeautifulSoup) -> pd.DataFrame:\n",
" \"\"\"Extract average monthly sunshine hours from soup build from Wikipedia sunshine table\n",
" \"\"\"\n",
" sunshine_tables = sunshine_soup.find_all('table')\n",
" \n",
" # Loop over tables\n",
" for table in sunshine_tables:\n",
" if len(table) >1:\n",
" \n",
" #If country in the list of country we found previously\n",
" #append the country to the dictionary\n",
" if country in countries:\n",
" # Loop over rows\n",
" ## @COMPLETE : extract all the rows\n",
" # table_rows = ...\n",
" for table_row in table_rows[1:]: # skip the first row (header)\n",
" ## @COMPLETE : extract all the cells\n",
" # table_cells = ...\n",
" \n",
"
sun = cells[-2].text.strip()
\n",
"
sun = process_num(sun)/10
\n",
" \n",
"
#If country is already in the dictionary
\n",
"
#add to the existing
sun
hours
of that country and the count to keep track of how many times we add
\n",
"
if country in country_suns:
\n",
"
count[
country
] += 1
\n",
"
country
_suns[
country
] += sun
\n",
" \n",
"
# Extract country and sunshine hours
\n",
"
country = table_cells[COUNTRY_POSITION_IN_SUN_TABLE].text.strip()
\n",
"
yearly_sun_hours = table_cells[SUNSHINE_POSITION_IN_SUN_TABLE].text.strip()
\n",
"
yearly_sun_hours = process_num(yearly_sun_hours)
\n",
"
monthly_
sun
_
hours
= yearly_sun_hours/12
\n",
"\n",
"
# Record hours for every city in the
country\n",
"
if
country
in
country
_sunshine:
\n",
"
country_sunshine[country].append(monthly_sun_hours)
\n",
" else:\n",
" count[country] = 1\n",
" country_suns[country] = sun\n",
" country_sunshine[country] = [monthly_sun_hours]\n",
" \n",
"\n",
"#Find the average temperature of each country\n",
"for country in country_suns:\n",
" print(country_suns[country],count[country])\n",
" country_suns[country] = round(country_suns[country]/count[country],2)\n",
" print('Country: {}, Sunshine Hours: {}'.format(country,country_suns[country]))\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 192 entries, United States to Japan\n",
"Data columns (total 3 columns):\n",
"Rank 192 non-null int64\n",
"DALY rate 192 non-null float64\n",
"Sunshine Hours/Year 122 non-null float64\n",
"dtypes: float64(2), int64(1)\n",
"memory usage: 11.0+ KB\n"
]
}
],
"source": [
"df2 = pd.DataFrame.from_dict(country_suns,orient='index', columns = ['Sunshine Hours/Year'])\n",
"\n",
"df = df1.join(df2)\n",
" # Finally, take the average temperature over each country\n",
" for country in country_sunshine:\n",
" country_sunshine[country] = round(np.average(country_sunshine[country]))\n",
" \n",
" return pd.DataFrame.from_dict(country_sunshine, orient='index', columns = ['Sunshine Hours/Month'])\n",
"\n",
"df.info()\n"
"df_sunshine = extract_monthly_sunshine_hours(sunshine_soup)\n",
"print(f'Extracted sunshine data for {df_sunshine.shape[0]} countries')\n",
"display(df_sunshine.head())"
]
},
{
"cell_type": "code",
"execution_count": 21,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"
df.dropna(inplace=True)
"
"
## 3 - Compare depression to sunshine\n
"
]
},
{
"cell_type": "code",
"execution_count":
38
,
"execution_count":
null
,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 122 entries, United States to Japan\n",
"Data columns (total 3 columns):\n",
"Rank 122 non-null int64\n",
"DALY rate 122 non-null float64\n",
"Sunshine Hours/Year 122 non-null float64\n",
"dtypes: float64(2), int64(1)\n",
"memory usage: 8.8+ KB\n"
]
},
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a1a728410>"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"outputs": [],
"source": [
"df.info()\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"sns.scatterplot('Rank', 'Sunshine Hours/Year', data=df)"
"df_joined = df_depression.join(df_sunshine)\n",
"df_joined = df_joined[~df_joined.isnull().any(axis=1)]\n",
"print(f'Having both depression and sunshine information for {df_joined.shape[0]} countries')\n",
"display(df_joined.head())"
]
},
{
"cell_type": "code",
"execution_count":
36
,
"execution_count":
null
,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Rank</th>\n",
" <th>DALY rate</th>\n",
" <th>Sunshine Hours/Year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>Rank</td>\n",
" <td>1.000000</td>\n",
" <td>-0.963597</td>\n",
" <td>0.346623</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DALY rate</td>\n",
" <td>-0.963597</td>\n",
" <td>1.000000</td>\n",
" <td>-0.285906</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Sunshine Hours/Year</td>\n",
" <td>0.346623</td>\n",
" <td>-0.285906</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Rank DALY rate Sunshine Hours/Year\n",
"Rank 1.000000 -0.963597 0.346623\n",
"DALY rate -0.963597 1.000000 -0.285906\n",
"Sunshine Hours/Year 0.346623 -0.285906 1.000000"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"df.corr()"
"correlation = df_joined.corr().iloc[0,1]\n",
"sns.scatterplot(\n",
" data=df_joined,\n",
" x='DALY rate',\n",
" y='Sunshine Hours/Month'\n",
").set_title(f'Pearson correlation : {correlation: 5.2f}');"
]
},
{
...
...
%% Cell type:markdown id: tags:
# Exercise 1 - Parsing HTML
The following notebook is greatly inspired by https://github.com/khuyentran1401/Web-Scrapping-Wikipedia
The following notebook is inspired by https://github.com/khuyentran1401/Web-Scrapping-Wikipedia
%% Cell type:markdown id: tags:
### Packages , Paths and Functions
%% Cell type:code id: tags:
```
python
import
requests
import
urllib.request
import
time
from
bs4
import
BeautifulSoup
import
re
from
pathlib
import
Path
import
numpy
as
np
import
pandas
as
pd
from
urllib.request
import
urlopen
from
bs4
import
BeautifulSoup
import
seaborn
as
sns
DATA_REPO
=
```
DATA_PATH
=
Path
(
'
../data/
'
)
%% Cell type:code id: tags:
# Epidemiology webpage : https://en.wikipedia.org/wiki/Epidemiology_of_depression
DEPRESSION_FILENAME
=
'
a1_epidemiology_of_depression.html
'
# Stored locally
```
python
url
=
'
https://en.wikipedia.org/wiki/Epidemiology_of_depression
'
```
# Epidemiology webpage : https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration
SUNSHINE_FILENAME
=
'
a1_city_sunshine_duration.html
'
# Stored locally
%% Cell type:code id: tags:
def
read_html_file
(
path
:
Path
,
filename
:
str
)
->
str
:
"""
Read an HTML stored locally
"""
with
open
(
path
/
filename
,
"
r
"
)
as
file
:
return
file
.
read
()
```
python
html
=
urlopen
(
url
)
```
def
process_num
(
string_number
:
str
)
->
float
:
"""
Convert a string number formatted with a comma to separate thousands
%% Cell type:code id: tags:
```
python
soup
=
BeautifulSoup
(
html
,
'
html.parser
'
)
Example : 1,823.0 -> 1823.0
"""
return
float
(
re
.
sub
(
r
'
[^\w\s.]
'
,
''
,
string_number
))
```
%% Cell type:
code
id: tags:
%% Cell type:
markdown
id: tags:
```
python
tables
=
soup
.
find_all
(
'
table
'
)
```
## 1 - Create depression table
%% Cell type:code id: tags:
```
python
#convert number as string to integer
#re.sub() returns the substring that match the regrex
import
re
def
process_num
(
num
):
return
float
(
re
.
sub
(
r
'
[^\w\s.]
'
,
''
,
num
))
depression_html
=
read_html_file
(
DATA_PATH
,
DEPRESSION_FILENAME
)
depression_soup
=
BeautifulSoup
(
depression_html
,
'
html.parser
'
)
```
%% Cell type:code id: tags:
```
python
num1
=
re
.
sub
(
r
'
[^\w\s.]
'
,
''
,
'
1,156.30
'
)
num1
```
depression_rates
=
[]
# preparing list to contain the different depression rates
depression_countries
=
[]
# preparing list to contain the different country names
%% Output
COUNTRY_POSITION_IN_DEP_TABLE
=
0
RATE_POSITION_IN_DEP_TABLE
=
2
'1156.30'
def
extract_depression_rates
(
depression_soup
:
BeautifulSoup
)
->
pd
.
DataFrame
:
"""
Extract depression rates from soup build from Wikipedia depression table
"""
%% Cell type:code id: tags:
# Extract the table from the soup
tables
=
depression_soup
.
find_all
(
'
table
'
)
depression_table
=
tables
[
0
]
# ignore the glossary at the end
```
python
ranks
=
[]
rates
=
[]
countries
=
[]
links
=
[]
# Loop over rows
## @COMPLETE : extract all the rows
# table_rows = ...
for
table_row
in
table_rows
:
## @COMPLETE : extract all the cells
# table_cells = ...
for
table
in
tables
:
rows
=
table
.
find_all
(
'
tr
'
)
if
len
(
table_cells
)
>
1
:
country
=
table_cells
[
COUNTRY_POSITION_IN_DEP_TABLE
]
depression_countries
.
append
(
country
.
text
.
strip
())
for
row
in
rows
:
cells
=
row
.
find_all
(
'
td
'
)
rate
=
table_cells
[
RATE_POSITION_IN_DEP_TABLE
]
depression_rates
.
append
(
round
(
float
(
rate
.
text
.
strip
())))
return
pd
.
DataFrame
(
depression_rates
,
index
=
depression_countries
,
columns
=
[
'
DALY rate
'
])
if
len
(
cells
)
>
1
:
rank
=
cells
[
0
]
ranks
.
append
(
int
(
rank
.
text
))
country
=
cells
[
1
]
countries
.
append
(
country
.
text
.
strip
())
df_depression
=
extract_depression_rates
(
depression_soup
)
print
(
f
'
Extracted depression data for
{
df_depression
.
shape
[
0
]
}
countries
'
)
display
(
df_depression
.
head
())
```
rate
=
cells
[
2
]
rates
.
append
(
process_num
(
rate
.
text
.
strip
()))
%% Cell type:markdown id: tags:
link
=
cells
[
1
].
find
(
'
a
'
).
get
(
'
href
'
)
links
.
append
(
'
https://en.wikipedia.org/
'
+
link
)
## 2 - Create sunshine table
df1
=
pd
.
DataFrame
(
ranks
,
index
=
countries
,
columns
=
[
'
Rank
'
])
df1
[
'
DALY rate
'
]
=
rates
%% Cell type:code id: tags:
df1
.
head
(
10
)
```
python
sunshine_html
=
read_html_file
(
DATA_PATH
,
SUNSHINE_FILENAME
)
sunshine_soup
=
BeautifulSoup
(
sunshine_html
,
'
html.parser
'
)
```
%% Output
Rank DALY rate
United States 1 1454.74
Nepal 2 1424.48
East Timor 3 1404.10
Bangladesh 4 1401.53
India 5 1400.84
Pakistan 6 1400.42
Brazil 7 1396.10
Maldives 8 1391.61
Bhutan 9 1385.53
Afghanistan 10 1385.14
%% Cell type:code id: tags:
```
python
sun_url
=
urlopen
(
'
https://en.wikipedia.org/wiki/List_of_cities_by_sunshine_duration
'
)
sun
=
BeautifulSoup
(
sun_url
,
'
html.parser
'
)
tables
=
sun
.
find_all
(
'
table
'
)
#Dictionary to hold the name of the country and its corresponding temperature
country_suns
=
{}
country_suns
hine
=
{}
#Dictionary to hold the country and its frequency in the table
count
=
{}
for
table
in
tables
:
if
len
(
table
)
>
1
:
rows
=
table
.
find_all
(
'
tr
'
)
#Skip the first row, which is the name of the columns
for
row
in
rows
[
1
:]:
cells
=
row
.
find_all
(
'
td
'
)
country
=
cells
[
0
].
text
.
strip
()
#If country in the list of country we found previously
#append the country to the dictionary
if
country
in
countries
:
sun
=
cells
[
-
2
].
text
.
strip
()
sun
=
process_num
(
sun
)
/
10
#If country is already in the dictionary
#add to the existing sun hours of that country and the count to keep track of how many times we add
if
country
in
country_suns
:
count
[
country
]
+=
1
country_suns
[
country
]
+=
sun
COUNTRY_POSITION_IN_SUN_TABLE
=
0
SUNSHINE_POSITION_IN_SUN_TABLE
=
-
2
def
extract_monthly_sunshine_hours
(
sunshine_soup
:
BeautifulSoup
)
->
pd
.
DataFrame
:
"""
Extract average monthly sunshine hours from soup build from Wikipedia sunshine table
"""
sunshine_tables
=
sunshine_soup
.
find_all
(
'
table
'
)
# Loop over tables
for
table
in
sunshine_tables
:
if
len
(
table
)
>
1
:
# Loop over rows
## @COMPLETE : extract all the rows
# table_rows = ...
for
table_row
in
table_rows
[
1
:]:
# skip the first row (header)
## @COMPLETE : extract all the cells
# table_cells = ...
# Extract country and sunshine hours
country
=
table_cells
[
COUNTRY_POSITION_IN_SUN_TABLE
].
text
.
strip
()
yearly_sun_hours
=
table_cells
[
SUNSHINE_POSITION_IN_SUN_TABLE
].
text
.
strip
()
yearly_sun_hours
=
process_num
(
yearly_sun_hours
)
monthly_sun_hours
=
yearly_sun_hours
/
12
# Record hours for every city in the country
if
country
in
country_sunshine
:
country_sunshine
[
country
].
append
(
monthly_sun_hours
)
else
:
count
[
country
]
=
1
country_suns
[
country
]
=
sun
#Find the average temperature of each country
for
country
in
country_suns
:
print
(
country_suns
[
country
],
count
[
country
])
country_suns
[
country
]
=
round
(
country_suns
[
country
]
/
count
[
country
],
2
)
print
(
'
Country: {}, Sunshine Hours: {}
'
.
format
(
country
,
country_suns
[
country
]))
country_sunshine
[
country
]
=
[
monthly_sun_hours
]
```
%% Output
789.14 3
Country: Benin, Sunshine Hours: 263.05
515.99 2
Country: Togo, Sunshine Hours: 258.0
710.25 3
Country: Ghana, Sunshine Hours: 236.75
866.0500000000001 4
Country: Cameroon, Sunshine Hours: 216.51
344.03999999999996 2
Country: Gabon, Sunshine Hours: 172.02
1334.54 5
Country: Nigeria, Sunshine Hours: 266.91
711.91 2
Country: Sudan, Sunshine Hours: 355.95
336.1 1
Country: Eritrea, Sunshine Hours: 336.1
641.8 2
Country: Burkina Faso, Sunshine Hours: 320.9
320.32 1
Country: Niger, Sunshine Hours: 320.32
670.6400000000001 2
Country: Chad, Sunshine Hours: 335.32
307.0 1
Country: Gambia, Sunshine Hours: 307.0
629.2 2
Country: Senegal, Sunshine Hours: 314.6
620.5999999999999 2
Country: Somalia, Sunshine Hours: 310.3
327.9 1
Country: Djibouti, Sunshine Hours: 327.9
964.0099999999999 3
Country: Mali, Sunshine Hours: 321.34
653.3 2
Country: Algeria, Sunshine Hours: 326.65
609.99 2
Country: Tunisia, Sunshine Hours: 305.0
946.64 3
Country: Morocco, Sunshine Hours: 315.55
2253.8500000000004 6
Country: Egypt, Sunshine Hours: 375.64
635.6199999999999 2
Country: Libya, Sunshine Hours: 317.81
1212.01 4
Country: Kenya, Sunshine Hours: 303.0
234.1 1
Country: Angola, Sunshine Hours: 234.1
1213.1399999999999 4
Country: Tanzania, Sunshine Hours: 303.28
556.97 2
Country: Ethiopia, Sunshine Hours: 278.49
666.5 2
Country: Mauritania, Sunshine Hours: 333.25
1884.79 6
Country: South Africa, Sunshine Hours: 314.13
1028.0 3
Country: Botswana, Sunshine Hours: 342.67
889.6400000000001 3
Country: Zambia, Sunshine Hours: 296.55
613.08 2
Country: Zimbabwe, Sunshine Hours: 306.54
838.76 3
Country: Malawi, Sunshine Hours: 279.59
1718.66 6
Country: Madagascar, Sunshine Hours: 286.44
283.8 1
Country: Mozambique, Sunshine Hours: 283.8
681.8 3
Country: Uganda, Sunshine Hours: 227.27
237.34 1
Country: Burundi, Sunshine Hours: 237.34
488.0 2
Country: Guinea, Sunshine Hours: 244.0
270.7 1
Country: Guinea-Bissau, Sunshine Hours: 270.7
309.79 2
Country: Equatorial Guinea, Sunshine Hours: 154.9
747.5 2
Country: Namibia, Sunshine Hours: 373.75
317.51 1
Country: Afghanistan, Sunshine Hours: 317.51
220.74 1
Country: Azerbaijan, Sunshine Hours: 220.74
206.6 1
Country: Bangladesh, Sunshine Hours: 206.6
1091.49 5
Country: China, Sunshine Hours: 218.3
973.66 4
Country: India, Sunshine Hours: 243.41
298.33000000000004 1
Country: Indonesia, Sunshine Hours: 298.33
282.61 1
Country: Iran, Sunshine Hours: 282.61
324.08000000000004 1
Country: Iraq, Sunshine Hours: 324.08
331.1 1
Country: Israel, Sunshine Hours: 331.1
361.71000000000004 2
Country: Japan, Sunshine Hours: 180.86
486.29999999999995 2
Country: Kazakhstan, Sunshine Hours: 243.15
279.15 1
Country: Mongolia, Sunshine Hours: 279.15
249.2 1
Country: North Korea, Sunshine Hours: 249.2
349.33000000000004 1
Country: Oman, Sunshine Hours: 349.33
598.4300000000001 2
Country: Pakistan, Sunshine Hours: 299.22
210.31 1
Country: Philippines, Sunshine Hours: 210.31
1578.2299999999998 8
Country: Russia, Sunshine Hours: 197.28
647.3 2
Country: Saudi Arabia, Sunshine Hours: 323.65
202.24 1
Country: Singapore, Sunshine Hours: 202.24
439.33000000000004 2
Country: South Korea, Sunshine Hours: 219.67
870.0099999999999 4
Country: Thailand, Sunshine Hours: 217.5
466.76 2
Country: Turkey, Sunshine Hours: 233.38
282.39 1
Country: Uzbekistan, Sunshine Hours: 282.39
849.4 4
Country: Vietnam, Sunshine Hours: 212.35
254.4 1
Country: Albania, Sunshine Hours: 254.4
247.4 1
Country: Armenia, Sunshine Hours: 247.4
188.4 1
Country: Austria, Sunshine Hours: 188.4
180.7 1
Country: Belarus, Sunshine Hours: 180.7
154.6 1
Country: Belgium, Sunshine Hours: 154.6
176.9 1
Country: Bosnia and Herzegovina, Sunshine Hours: 176.9
217.7 1
Country: Bulgaria, Sunshine Hours: 217.7
191.3 1
Country: Croatia, Sunshine Hours: 191.3
166.8 1
Country: Czech Republic, Sunshine Hours: 166.8
331.40999999999997 1
Country: Cyprus, Sunshine Hours: 331.41
173.9 1
Country: Denmark, Sunshine Hours: 173.9
182.6 1
Country: Estonia, Sunshine Hours: 182.6
185.8 1
Country: Finland, Sunshine Hours: 185.8
449.8 2
Country: France, Sunshine Hours: 224.9
204.6 1
Country: Georgia, Sunshine Hours: 204.6
328.79999999999995 2
Country: Germany, Sunshine Hours: 164.4
595.0 2
Country: Greece, Sunshine Hours: 297.5
198.8 1
Country: Hungary, Sunshine Hours: 198.8
132.6 1
Country: Iceland, Sunshine Hours: 132.6
145.3 1
Country: Ireland, Sunshine Hours: 145.3
438.8 2
Country: Italy, Sunshine Hours: 219.4
175.4 1
Country: Latvia, Sunshine Hours: 175.4
169.1 1
Country: Lithuania, Sunshine Hours: 169.1
305.4 1
Country: Malta, Sunshine Hours: 305.4
212.6 1
Country: Moldova, Sunshine Hours: 212.6
166.2 1
Country: Netherlands, Sunshine Hours: 166.2
166.8 1
Country: Norway, Sunshine Hours: 166.8
157.1 1
Country: Poland, Sunshine Hours: 157.1
280.6 1
Country: Portugal, Sunshine Hours: 280.6
211.5 1
Country: Romania, Sunshine Hours: 211.5
203.8 1
Country: Slovakia, Sunshine Hours: 203.8
197.4 1
Country: Slovenia, Sunshine Hours: 197.4
826.6 3
Country: Spain, Sunshine Hours: 275.53
374.29999999999995 2
Country: Sweden, Sunshine Hours: 187.15
156.6 1
Country: Switzerland, Sunshine Hours: 156.6
195.5 1
Country: Ukraine, Sunshine Hours: 195.5
306.0 2
Country: United Kingdom, Sunshine Hours: 153.0
1825.24 9
Country: Canada, Sunshine Hours: 202.8
225.98000000000002 1
Country: Honduras, Sunshine Hours: 225.98
1038.5 4
Country: Mexico, Sunshine Hours: 259.62
275.99 1
Country: Nicaragua, Sunshine Hours: 275.99
174.35 1
Country: Panama, Sunshine Hours: 174.35
295.7 1
Country: El Salvador, Sunshine Hours: 295.7
15218.579999999998 54
Country: United States, Sunshine Hours: 281.83
1149.52 5
Country: Argentina, Sunshine Hours: 229.9
228.89000000000001 1
Country: Bolivia, Sunshine Hours: 228.89
1322.58 6
Country: Brazil, Sunshine Hours: 220.43
953.81 6
Country: Colombia, Sunshine Hours: 158.97
1324.27 5
Country: Chile, Sunshine Hours: 264.85
381.90999999999997 2
Country: Ecuador, Sunshine Hours: 190.95
280.3 1
Country: Paraguay, Sunshine Hours: 280.3
604.0 3
Country: Peru, Sunshine Hours: 201.33
248.14000000000001 1
Country: Uruguay, Sunshine Hours: 248.14
579.0899999999999 2
Country: Venezuela, Sunshine Hours: 289.54
2553.15 9
Country: Australia, Sunshine Hours: 283.68
192.2 1
Country: Fiji, Sunshine Hours: 192.2
613.1999999999999 3
Country: New Zealand, Sunshine Hours: 204.4
246.3 1
Country: Papua New Guinea, Sunshine Hours: 246.3
233.0 1
Country: Solomon Islands, Sunshine Hours: 233.0
# Finally, take the average temperature over each country
for
country
in
country_sunshine
:
country_sunshine
[
country
]
=
round
(
np
.
average
(
country_sunshine
[
country
]))
%% Cell type:code id: tags:
return
pd
.
DataFrame
.
from_dict
(
country_sunshine
,
orient
=
'
index
'
,
columns
=
[
'
Sunshine Hours/Month
'
])
```
python
df2
=
pd
.
DataFrame
.
from_dict
(
country_suns
,
orient
=
'
index
'
,
columns
=
[
'
Sunshine Hours/Year
'
])
df
=
df1
.
join
(
df2
)
df
.
info
()
df_sunshine
=
extract_monthly_sunshine_hours
(
sunshine_soup
)
print
(
f
'
Extracted sunshine data for
{
df_sunshine
.
shape
[
0
]
}
countries
'
)
display
(
df_sunshine
.
head
())
```
%% Output
<class 'pandas.core.frame.DataFrame'>
Index: 192 entries, United States to Japan
Data columns (total 3 columns):
Rank 192 non-null int64
DALY rate 192 non-null float64
Sunshine Hours/Year 122 non-null float64
dtypes: float64(2), int64(1)
memory usage: 11.0+ KB
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
```
python
df
.
dropna
(
inplace
=
True
)
```
## 3 - Compare depression to sunshine
%% Cell type:code id: tags:
```
python
df
.
info
()
import
matplotlib.pyplot
as
plt
import
seaborn
as
sns
sns
.
scatterplot
(
'
Rank
'
,
'
Sunshine Hours/Year
'
,
data
=
df
)
df_joined
=
df_depression
.
join
(
df_sunshine
)
df_joined
=
df_joined
[
~
df_joined
.
isnull
().
any
(
axis
=
1
)]
print
(
f
'
Having both depression and sunshine information for
{
df_joined
.
shape
[
0
]
}
countries
'
)
display
(
df_joined
.
head
())
```
%% Output
<class 'pandas.core.frame.DataFrame'>
Index: 122 entries, United States to Japan
Data columns (total 3 columns):
Rank 122 non-null int64
DALY rate 122 non-null float64
Sunshine Hours/Year 122 non-null float64
dtypes: float64(2), int64(1)
memory usage: 8.8+ KB
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a728410>
%% Cell type:code id: tags:
```
python
df
.
corr
()
correlation
=
df_joined
.
corr
().
iloc
[
0
,
1
]
sns
.
scatterplot
(
data
=
df_joined
,
x
=
'
DALY rate
'
,
y
=
'
Sunshine Hours/Month
'
).
set_title
(
f
'
Pearson correlation :
{
correlation
:
5.2
f
}
'
);
```
%% Output
Rank DALY rate Sunshine Hours/Year
Rank 1.000000 -0.963597 0.346623
DALY rate -0.963597 1.000000 -0.285906
Sunshine Hours/Year 0.346623 -0.285906 1.000000
%% Cell type:code id: tags:
```
python
```
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
Aperçu
0%
Chargement en cours
Veuillez réessayer
ou
joindre un nouveau fichier
.
Annuler
You are about to add
0
people
to the discussion. Proceed with caution.
Terminez d'abord l'édition de ce message.
Enregistrer le commentaire
Annuler
Veuillez vous
inscrire
ou vous
se connecter
pour commenter