{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Les données\n",
"\n",
"Travailler avec des données est assez facile en Python. Le module `pandas` permet de manipuler des grandes bases de données facilement. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Les dictionnaires"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Supposons que nous avons deux listes. L'une contient des noms de pays, "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"countries = ['Canada','United States','Germany','France','Italy']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"L'autre contient la population de chacun des pays, trouvé ici"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pop = [38.068,332.915,83.9,64.426,60.367]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On aimerait pouvoir travailler avec ces données, obtenir la population du Pays en invoquant son nom, etc. Une base de donnée pandas, c'est-à-dire un dataframe n'est rien d'autre qu'un objet qui est crée autour d'un type de données en Python, le dictionnaire. Voyons voir ce qu'est un dictionnaire en le déclarant basé sur deux listes jumelées en utilisant `zip`: "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"map_pop = dict(zip(countries,pop))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Maintenant, on veut obtenir la population de l'Allemagne. On n'a qu'à l'invoquer: "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"83.9"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"map_pop['Germany']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Un dictionnaire est composé d'une clé et d'items. Voyons voir"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['Canada', 'United States', 'Germany', 'France', 'Italy'])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"map_pop.keys()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_values([38.068, 332.915, 83.9, 64.426, 60.367])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"map_pop.values()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_items([('Canada', 38.068), ('United States', 332.915), ('Germany', 83.9), ('France', 64.426), ('Italy', 60.367)])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"map_pop.items()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On peut aussi composer le dictionnaire de la façon suivante:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"map_pop2 = {'Canada':38.068,'United States':332.815,'Germany':83.9,'France':64.426,'Italy':60.367}"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"83.9"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"map_pop2['Germany']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Un dataframe est construit autour d'un dictionnaire sur des listes (ou arrays). Par exemple, "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame({'country':countries,'population':pop})"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" country \n",
" population \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Canada \n",
" 38.068 \n",
" \n",
" \n",
" 1 \n",
" United States \n",
" 332.915 \n",
" \n",
" \n",
" 2 \n",
" Germany \n",
" 83.900 \n",
" \n",
" \n",
" 3 \n",
" France \n",
" 64.426 \n",
" \n",
" \n",
" 4 \n",
" Italy \n",
" 60.367 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country population\n",
"0 Canada 38.068\n",
"1 United States 332.915\n",
"2 Germany 83.900\n",
"3 France 64.426\n",
"4 Italy 60.367"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On peut aussi créer en utilisant"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"df2 = pd.DataFrame(index=countries,columns=['population'],data=pop)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" \n",
" Canada \n",
" 38.068 \n",
" \n",
" \n",
" United States \n",
" 332.915 \n",
" \n",
" \n",
" Germany \n",
" 83.900 \n",
" \n",
" \n",
" France \n",
" 64.426 \n",
" \n",
" \n",
" Italy \n",
" 60.367 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" population\n",
"Canada 38.068\n",
"United States 332.915\n",
"Germany 83.900\n",
"France 64.426\n",
"Italy 60.367"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Il y a une différence entre les deux dataframe. Le premier a deux colonnes, country et pop. Il a une colonne en gras qui débute par zéro et qui semble indiqué le numéro de l'observation. Le deuxième n'a pas la variable country mais a plutôt cette colonne qui est en gras and les noms de pays. Cette colonne en gras est l'index. L'avantage de l'index est qu'il me permet d'obtenir la valeur pour un pays plus facilement que si j'avais utilisé la colonne country. "
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"83.9"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.loc['Germany','population']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Il n'y a qu'un seul élément, donc c'est un scalaire. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Supossons que je veux deux pays, "
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Germany 83.900\n",
"Italy 60.367\n",
"Name: population, dtype: float64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.loc[['Germany','Italy'],'population']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Le résultat n'est pas un scalaire, mais plutôt ce qu'on appelle une `Series` de Pandas, puisque c'est seulement une colonne."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Je peux faire un sort sur mon index. "
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" population \n",
" \n",
" \n",
" \n",
" \n",
" Canada \n",
" 38.068 \n",
" \n",
" \n",
" France \n",
" 64.426 \n",
" \n",
" \n",
" Germany \n",
" 83.900 \n",
" \n",
" \n",
" Italy \n",
" 60.367 \n",
" \n",
" \n",
" United States \n",
" 332.915 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" population\n",
"Canada 38.068\n",
"France 64.426\n",
"Germany 83.900\n",
"Italy 60.367\n",
"United States 332.915"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Je peux rajouter une variable, le PIB de chaque pays, en milliards"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"df2"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"gdp = [1711.39,20494.1,4000.39,2775.25,2072.2]"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"df2['gdp'] = gdp"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" population \n",
" gdp \n",
" \n",
" \n",
" \n",
" \n",
" Canada \n",
" 38.068 \n",
" 1711.39 \n",
" \n",
" \n",
" United States \n",
" 332.915 \n",
" 20494.10 \n",
" \n",
" \n",
" Germany \n",
" 83.900 \n",
" 4000.39 \n",
" \n",
" \n",
" France \n",
" 64.426 \n",
" 2775.25 \n",
" \n",
" \n",
" Italy \n",
" 60.367 \n",
" 2072.20 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" population gdp\n",
"Canada 38.068 1711.39\n",
"United States 332.915 20494.10\n",
"Germany 83.900 4000.39\n",
"France 64.426 2775.25\n",
"Italy 60.367 2072.20"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Les fonctions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Un paquet de statistiques sont disponible sous pandas"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"population 115.9352\n",
"gdp 6210.6660\n",
"dtype: float64"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.mean()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"population 579.676\n",
"gdp 31053.330\n",
"dtype: float64"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.sum()"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" population \n",
" gdp \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 5.000000 \n",
" 5.000000 \n",
" \n",
" \n",
" mean \n",
" 115.935200 \n",
" 6210.666000 \n",
" \n",
" \n",
" std \n",
" 122.383425 \n",
" 8032.345163 \n",
" \n",
" \n",
" min \n",
" 38.068000 \n",
" 1711.390000 \n",
" \n",
" \n",
" 25% \n",
" 60.367000 \n",
" 2072.200000 \n",
" \n",
" \n",
" 50% \n",
" 64.426000 \n",
" 2775.250000 \n",
" \n",
" \n",
" 75% \n",
" 83.900000 \n",
" 4000.390000 \n",
" \n",
" \n",
" max \n",
" 332.915000 \n",
" 20494.100000 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" population gdp\n",
"count 5.000000 5.000000\n",
"mean 115.935200 6210.666000\n",
"std 122.383425 8032.345163\n",
"min 38.068000 1711.390000\n",
"25% 60.367000 2072.200000\n",
"50% 64.426000 2775.250000\n",
"75% 83.900000 4000.390000\n",
"max 332.915000 20494.100000"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On peut transposer le dernier tableau"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" mean \n",
" std \n",
" min \n",
" 25% \n",
" 50% \n",
" 75% \n",
" max \n",
" \n",
" \n",
" \n",
" \n",
" population \n",
" 5.0 \n",
" 115.9352 \n",
" 122.383425 \n",
" 38.068 \n",
" 60.367 \n",
" 64.426 \n",
" 83.90 \n",
" 332.915 \n",
" \n",
" \n",
" gdp \n",
" 5.0 \n",
" 6210.6660 \n",
" 8032.345163 \n",
" 1711.390 \n",
" 2072.200 \n",
" 2775.250 \n",
" 4000.39 \n",
" 20494.100 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count mean std min 25% 50% \\\n",
"population 5.0 115.9352 122.383425 38.068 60.367 64.426 \n",
"gdp 5.0 6210.6660 8032.345163 1711.390 2072.200 2775.250 \n",
"\n",
" 75% max \n",
"population 83.90 332.915 \n",
"gdp 4000.39 20494.100 "
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.describe().transpose()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Les calculs sur les colonnes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Le nom des colonnes se trouve dans "
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['population', 'gdp'], dtype='object')"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculons le PIB par habitant"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"df2['gdp_per_cap'] = df2['gdp']*1e3/df2['population']"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" population \n",
" gdp \n",
" gdp_per_cap \n",
" \n",
" \n",
" \n",
" \n",
" Canada \n",
" 38.068 \n",
" 1711.39 \n",
" 44956.131134 \n",
" \n",
" \n",
" United States \n",
" 332.915 \n",
" 20494.10 \n",
" 61559.557244 \n",
" \n",
" \n",
" Germany \n",
" 83.900 \n",
" 4000.39 \n",
" 47680.452920 \n",
" \n",
" \n",
" France \n",
" 64.426 \n",
" 2775.25 \n",
" 43076.552944 \n",
" \n",
" \n",
" Italy \n",
" 60.367 \n",
" 2072.20 \n",
" 34326.701675 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" population gdp gdp_per_cap\n",
"Canada 38.068 1711.39 44956.131134\n",
"United States 332.915 20494.10 61559.557244\n",
"Germany 83.900 4000.39 47680.452920\n",
"France 64.426 2775.25 43076.552944\n",
"Italy 60.367 2072.20 34326.701675"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On peut classer sur la base du PIB per capita"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" population \n",
" gdp \n",
" gdp_per_cap \n",
" \n",
" \n",
" \n",
" \n",
" United States \n",
" 332.915 \n",
" 20494.10 \n",
" 61559.557244 \n",
" \n",
" \n",
" Germany \n",
" 83.900 \n",
" 4000.39 \n",
" 47680.452920 \n",
" \n",
" \n",
" Canada \n",
" 38.068 \n",
" 1711.39 \n",
" 44956.131134 \n",
" \n",
" \n",
" France \n",
" 64.426 \n",
" 2775.25 \n",
" 43076.552944 \n",
" \n",
" \n",
" Italy \n",
" 60.367 \n",
" 2072.20 \n",
" 34326.701675 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" population gdp gdp_per_cap\n",
"United States 332.915 20494.10 61559.557244\n",
"Germany 83.900 4000.39 47680.452920\n",
"Canada 38.068 1711.39 44956.131134\n",
"France 64.426 2775.25 43076.552944\n",
"Italy 60.367 2072.20 34326.701675"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.sort_values(by='gdp_per_cap',ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On peut référer aux colonnes en utilisant deux notations"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Canada 44956.131134\n",
"United States 61559.557244\n",
"Germany 47680.452920\n",
"France 43076.552944\n",
"Italy 34326.701675\n",
"Name: gdp_per_cap, dtype: float64"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.gdp_per_cap"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Canada 44956.131134\n",
"United States 61559.557244\n",
"Germany 47680.452920\n",
"France 43076.552944\n",
"Italy 34326.701675\n",
"Name: gdp_per_cap, dtype: float64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2['gdp_per_cap']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Une nouvelle colonne doit être crée par la dernière notation. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Merge"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Supposons une autre base de donnée qui contient le continent des pays"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"df3 = pd.DataFrame(index=countries,columns=['continent'],data=['North America','North America','Europe','Europe','Europe'])"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" continent \n",
" \n",
" \n",
" \n",
" \n",
" Canada \n",
" North America \n",
" \n",
" \n",
" United States \n",
" North America \n",
" \n",
" \n",
" Germany \n",
" Europe \n",
" \n",
" \n",
" France \n",
" Europe \n",
" \n",
" \n",
" Italy \n",
" Europe \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" continent\n",
"Canada North America\n",
"United States North America\n",
"Germany Europe\n",
"France Europe\n",
"Italy Europe"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On peut joindre ces données dans notre première base de donnée"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"df4 = df2.merge(df3,left_index=True,right_index=True)"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" population \n",
" gdp \n",
" gdp_per_cap \n",
" continent \n",
" \n",
" \n",
" \n",
" \n",
" Canada \n",
" 38.068 \n",
" 1711.39 \n",
" 44956.131134 \n",
" North America \n",
" \n",
" \n",
" United States \n",
" 332.915 \n",
" 20494.10 \n",
" 61559.557244 \n",
" North America \n",
" \n",
" \n",
" Germany \n",
" 83.900 \n",
" 4000.39 \n",
" 47680.452920 \n",
" Europe \n",
" \n",
" \n",
" France \n",
" 64.426 \n",
" 2775.25 \n",
" 43076.552944 \n",
" Europe \n",
" \n",
" \n",
" Italy \n",
" 60.367 \n",
" 2072.20 \n",
" 34326.701675 \n",
" Europe \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" population gdp gdp_per_cap continent\n",
"Canada 38.068 1711.39 44956.131134 North America\n",
"United States 332.915 20494.10 61559.557244 North America\n",
"Germany 83.900 4000.39 47680.452920 Europe\n",
"France 64.426 2775.25 43076.552944 Europe\n",
"Italy 60.367 2072.20 34326.701675 Europe"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Les graphiques\n",
"\n",
"On peut faire des graphiques directement à partir d'un objet pandas!"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAEyCAYAAAD6Lqe7AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAccElEQVR4nO3dfdhc9V3n8feHpFBsG8pDQEwowZJ1C9gnItKyl7ZEJd3WBhWuTa0lanazVqqturpQrdhqXPBaxbIWuljaBsSFLLUlWtEitHatCE1aWuRJYqEQoSTyVGyFNvjdP87vbiY3d5KZJNznDvN+XddcM/Odc4bvDEk+c87v/M5JVSFJ0j59NyBJmhkMBEkSYCBIkhoDQZIEGAiSpMZAkCQBMLvvBnbVIYccUgsWLOi7DUnaq6xfv/6fq2ruVK/ttYGwYMEC1q1b13cbkrRXSfLl7b3mLiNJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWr22olpe8KCsz7edwvcc+7r+25BkgC3ECRJzVCBkOSFSa5KckeS25O8KslBSa5Ncle7P3Bg+bOTbEhyZ5JTBurHJ7mlvXZBkrT6fkmubPUbkyzY459UkrRDw24hvBf4i6r698DLgNuBs4DrqmohcF17TpJjgGXAscAS4MIks9r7XASsBBa225JWXwE8UlVHA+cD5+3m55IkjWingZBkDvB9wCUAVfWNqnoUWAqsboutBk5tj5cCV1TVk1V1N7ABOCHJ4cCcqrqhqgq4dNI6E+91FbB4YutBkjQ9htlC+E5gM/ChJJ9P8oEkzwMOq6oHANr9oW35ecB9A+tvbLV57fHk+jbrVNUW4DHg4MmNJFmZZF2SdZs3bx7yI0qShjFMIMwGXglcVFWvAL5G2z20HVP9sq8d1He0zraFqouralFVLZo7d8rTeUuSdtEwgbAR2FhVN7bnV9EFxINtNxDtftPA8kcMrD8fuL/V509R32adJLOBA4CHR/0wkqRdt9NAqKqvAPcl+a5WWgzcBqwFlrfacuDq9ngtsKwdOXQU3eDxTW230uNJTmzjA2dMWmfivU4Drm/jDJKkaTLsxLSfAy5Psi/wJeCn6MJkTZIVwL3A6QBVdWuSNXShsQU4s6qeau/zVuDDwP7ANe0G3YD1ZUk20G0ZLNvNzyVJGtFQgVBVNwOLpnhp8XaWXwWsmqK+DjhuivoTtECRJPXDmcqSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUDBUISe5JckuSm5Osa7WDklyb5K52f+DA8mcn2ZDkziSnDNSPb++zIckFSdLq+yW5stVvTLJgD39OSdJOjLKF8NqqenlVLWrPzwKuq6qFwHXtOUmOAZYBxwJLgAuTzGrrXASsBBa225JWXwE8UlVHA+cD5+36R5Ik7Yrd2WW0FFjdHq8GTh2oX1FVT1bV3cAG4IQkhwNzquqGqirg0knrTLzXVcDiia0HSdL0GDYQCvhEkvVJVrbaYVX1AEC7P7TV5wH3Day7sdXmtceT69usU1VbgMeAg0f7KJKk3TF7yOVOqqr7kxwKXJvkjh0sO9Uv+9pBfUfrbPvGXRitBHjRi160444lSSMZaguhqu5v95uAjwInAA+23UC0+01t8Y3AEQOrzwfub/X5U9S3WSfJbOAA4OEp+ri4qhZV1aK5c+cO07okaUg7DYQkz0vygonHwA8Bfw+sBZa3xZYDV7fHa4Fl7ciho+gGj29qu5UeT3JiGx84Y9I6E+91GnB9G2eQJE2TYXYZHQZ8tI3xzgb+uKr+IslngTVJVgD3AqcDVNWtSdYAtwFbgDOr6qn2Xm8FPgzsD1zTbgCXAJcl2UC3ZbBsD3w2SdIIdhoIVfUl4GVT1B8CFm9nnVXAqinq64Djpqg/QQsUSVI/nKksSQIMBElSYyBIkoDh5yHoWW7BWR/vuwXuOff1fbcgjTW3ECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqXFimjSJk/Q0rtxCkCQBBoIkqTEQJEmAgSBJagwESRLgUUaSdsAjrsaLWwiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVIzdCAkmZXk80n+rD0/KMm1Se5q9wcOLHt2kg1J7kxyykD9+CS3tNcuSJJW3y/Jla1+Y5IFe/AzSpKGMMoWwtuB2weenwVcV1ULgevac5IcAywDjgWWABcmmdXWuQhYCSxstyWtvgJ4pKqOBs4HztulTyNJ2mVDBUKS+cDrgQ8MlJcCq9vj1cCpA/UrqurJqrob2ACckORwYE5V3VBVBVw6aZ2J97oKWDyx9SBJmh7DbiH8PvArwL8N1A6rqgcA2v2hrT4PuG9guY2tNq89nlzfZp2q2gI8Bhw87IeQJO2+nQZCkjcAm6pq/ZDvOdUv+9pBfUfrTO5lZZJ1SdZt3rx5yHYkScMYZgvhJOCNSe4BrgBOTvJHwINtNxDtflNbfiNwxMD684H7W33+FPVt1kkyGzgAeHhyI1V1cVUtqqpFc+fOHeoDSpKGs9NAqKqzq2p+VS2gGyy+vqp+AlgLLG+LLQeubo/XAsvakUNH0Q0e39R2Kz2e5MQ2PnDGpHUm3uu09t942haCJOmZsztnOz0XWJNkBXAvcDpAVd2aZA1wG7AFOLOqnmrrvBX4MLA/cE27AVwCXJZkA92WwbLd6EuStAtGCoSq+hTwqfb4IWDxdpZbBayaor4OOG6K+hO0QJEk9cOZypIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSszvXQ5CksbHgrI/33QL3nPv6Z/T93UKQJAEGgiSpMRAkSYCBIElqDARJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKnZaSAkeW6Sm5J8IcmtSd7d6gcluTbJXe3+wIF1zk6yIcmdSU4ZqB+f5Jb22gVJ0ur7Jbmy1W9MsuAZ+KySpB0YZgvhSeDkqnoZ8HJgSZITgbOA66pqIXBde06SY4BlwLHAEuDCJLPae10ErAQWttuSVl8BPFJVRwPnA+ft/keTJI1ip4FQnX9pT5/TbgUsBVa3+mrg1PZ4KXBFVT1ZVXcDG4ATkhwOzKmqG6qqgEsnrTPxXlcBiye2HiRJ02OoMYQks5LcDGwCrq2qG4HDquoBgHZ/aFt8HnDfwOobW21eezy5vs06VbUFeAw4eIo+ViZZl2Td5s2bh/qAkqThDBUIVfVUVb0cmE/3a/+4HSw+1S/72kF9R+tM7uPiqlpUVYvmzp27k64lSaMY6SijqnoU+BTdvv8H224g2v2mtthG4IiB1eYD97f6/Cnq26yTZDZwAPDwKL1JknbPMEcZzU3ywvZ4f+AHgDuAtcDytthy4Or2eC2wrB05dBTd4PFNbbfS40lObOMDZ0xaZ+K9TgOub+MMkqRpMswV0w4HVrcjhfYB1lTVnyW5AViTZAVwL3A6QFXdmmQNcBuwBTizqp5q7/VW4MPA/sA17QZwCXBZkg10WwbL9sSHkyQNb6eBUFVfBF4xRf0hYPF21lkFrJqivg542vhDVT1BCxRJUj+cqSxJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEnNTgMhyRFJPpnk9iS3Jnl7qx+U5Nokd7X7AwfWOTvJhiR3JjlloH58klvaaxckSavvl+TKVr8xyYJn4LNKknZgmC2ELcAvVdVLgBOBM5McA5wFXFdVC4Hr2nPaa8uAY4ElwIVJZrX3ughYCSxstyWtvgJ4pKqOBs4HztsDn02SNIKdBkJVPVBVn2uPHwduB+YBS4HVbbHVwKnt8VLgiqp6sqruBjYAJyQ5HJhTVTdUVQGXTlpn4r2uAhZPbD1IkqbHSGMIbVfOK4AbgcOq6gHoQgM4tC02D7hvYLWNrTavPZ5c32adqtoCPAYcPEpvkqTdM3QgJHk+8BHgHVX11R0tOkWtdlDf0TqTe1iZZF2SdZs3b95Zy5KkEQwVCEmeQxcGl1fVn7Tyg203EO1+U6tvBI4YWH0+cH+rz5+ivs06SWYDBwAPT+6jqi6uqkVVtWju3LnDtC5JGtIwRxkFuAS4vap+b+CltcDy9ng5cPVAfVk7cugousHjm9pupceTnNje84xJ60y812nA9W2cQZI0TWYPscxJwFuAW5Lc3GrvBM4F1iRZAdwLnA5QVbcmWQPcRneE0plV9VRb763Ah4H9gWvaDbrAuSzJBrotg2W797EkSaPaaSBU1d8w9T5+gMXbWWcVsGqK+jrguCnqT9ACRZLUD2cqS5IAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJKAIQIhyQeTbEry9wO1g5Jcm+Sudn/gwGtnJ9mQ5M4kpwzUj09yS3vtgiRp9f2SXNnqNyZZsIc/oyRpCMNsIXwYWDKpdhZwXVUtBK5rz0lyDLAMOLatc2GSWW2di4CVwMJ2m3jPFcAjVXU0cD5w3q5+GEnSrttpIFTVp4GHJ5WXAqvb49XAqQP1K6rqyaq6G9gAnJDkcGBOVd1QVQVcOmmdife6Clg8sfUgSZo+uzqGcFhVPQDQ7g9t9XnAfQPLbWy1ee3x5Po261TVFuAx4OBd7EuStIv29KDyVL/sawf1Ha3z9DdPViZZl2Td5s2bd7FFSdJUdjUQHmy7gWj3m1p9I3DEwHLzgftbff4U9W3WSTIbOICn76ICoKourqpFVbVo7ty5u9i6JGkquxoIa4Hl7fFy4OqB+rJ25NBRdIPHN7XdSo8nObGND5wxaZ2J9zoNuL6NM0iSptHsnS2Q5P8ArwEOSbIROAc4F1iTZAVwL3A6QFXdmmQNcBuwBTizqp5qb/VWuiOW9geuaTeAS4DLkmyg2zJYtkc+mSRpJDsNhKp603ZeWryd5VcBq6aorwOOm6L+BC1QJEn9caayJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAmYQYGQZEmSO5NsSHJW3/1I0riZEYGQZBbwPuB1wDHAm5Ic029XkjReZkQgACcAG6rqS1X1DeAKYGnPPUnSWElV9d0DSU4DllTVf27P3wJ8b1W9bdJyK4GV7el3AXdOa6NTOwT4576bmCH8Ljp+D1v5XWw1U76LI6tq7lQvzJ7uTrYjU9SellRVdTFw8TPfzvCSrKuqRX33MRP4XXT8Hrbyu9hqb/guZsouo43AEQPP5wP399SLJI2lmRIInwUWJjkqyb7AMmBtzz1J0liZEbuMqmpLkrcBfwnMAj5YVbf23NawZtQurJ75XXT8Hrbyu9hqxn8XM2JQWZLUv5myy0iS1DMDQZIEGAi7Jck+Seb03Yck7QkGwoiS/HGSOUmeB9wG3Jnkl/vuS9LMk+S4vnsYhYEwumOq6qvAqcCfAy8C3tJrRz1J8rYkB/bdx0yS5NAkL5q49d1PH5J8W5J3JfnD9nxhkjf03VdP3p/kpiQ/m+SFfTezMwbC6J6T5Dl0gXB1VX2TKWZVj4lvBz6bZE07W+1UM87HQpI3JrkLuBv4a+Ae4Jpem+rPh4AngVe15xuB3+qvnf5U1X8A3kw38XZd28Pwgz23tV0Gwuj+N91f9ucBn05yJPDVXjvqSVX9GrAQuAT4SeCuJL+d5MW9NtaP3wROBP6hqo4CFgOf6bel3ry4qn4H+CZAVf0rU5+eZixU1V3ArwH/Hfh+4IIkdyT50X47ezoDYURVdUFVzauq/1idLwOv7buvvlQ3keUr7bYFOBC4Ksnv9NrY9PtmVT0E7JNkn6r6JPDynnvqyzeS7E/bcm4/EJ7st6V+JHlpkvOB24GTgR+uqpe0x+f32twUZsRM5b1JksOA3wa+o6pe167b8Cq6X8ljJcnPA8vpzuD4AeCXq+qbSfYB7gJ+pc/+ptmjSZ4PfBq4PMkmuoAcR+cAfwEckeRy4CS6Lchx9Ad0fzfe2baUAKiq+5P8Wn9tTc2ZyiNKcg3dPtJfraqXJZkNfL6qvrvn1qZdkvcAl7StpMmvvaSqbu+hrV60o86eoNs18mbgAODyttUwdpIcTLcLLcDfVdVMOO2zdsJAGFGSz1bV9yT5fFW9otVurqqX99xaL9rV7g5jYGuzqu7tryP1LcmPANdX1WPt+QuB11TVx/rsazoluYWpDzYJ3Z7Wl05zS0Nxl9HovtZ+/UzsHz0ReKzflvrRTkj4G8CDwL+1cgEz8g/7MyHJ4+zgKLOqGseJi+dU1UcnnlTVo0nOAT7WX0vTbq88zNZAGN0v0p2a+8VJPgPMBU7vt6XevAP4rnHdLQJQVS+Ab+0++wpwGVt3G72gx9b6NNXBKmP1b81Uu1H3Bu4yGlGS/YCn6C7hGbrLeO5TVWN3FEWSTwI/WFXjOnj6LUlurKrv3VltHCT5IPAo8D66raefAw6sqp/ssa1etD0I/wt4CbAv3en9vzZTtxzHKrX3kBuq6pXAt67XkORzwCv7a6k3XwI+leTjDBxWWFW/119LvXkqyZuBK+j+EXwT3Q+HcfRzwLuAK+l+NH0COLPXjvrzB3QX/Pq/wCLgDODoXjvaAQNhSEm+HZgH7J/kFWydaDMH+LbeGuvXve22b7uNsx8H3ttuRTcp7cd77agnVfU14Ky++5gpqmpDkllV9RTwoSR/23dP22MgDO8UumOp5wODv4AfB97ZR0N9q6p3993DTFFV9wBL++5jJkjy74D/Bixg26PPTu6rpx59vV0W+OY2WfMBurMczEiOIYwoyY9V1Uf67mMmSDKXbvLZscBzJ+rj+Bc/yXOBFTz9u/jp3prqSZIvAO8H1jOw26yq1vfWVE/aqW0epNuC/gW6+Snvq6p/7LWx7XALYURV9ZEkr+fpf/Hf019Xvbmcbj/xG4CfoZu1vLnXjvpzGXAH3Zbke+iOMhqbiXmTbKmqi/puYoY4tareSzdp8d0ASd5Ot2txxvFcRiNK8n7gP9ENnIXukNMje22qPwdX1SV05/H56/Zr+MS+m+rJ0VX1LrojSFYDrwfGbvZ686ftdM+HJzlo4tZ3Uz1ZPkXtJ6e7iWG5hTC6V1fVS5N8sareneR3gT/pu6mefLPdP9C2mu6nG2MZRxPfxaPtoihfoduHPo4m/hEcvHBUAd/ZQy+9SPImuoMKjkqyduClFwAzdt6OgTC6iRNUfT3Jd9D9zz2qx3769FtJDgB+ie5Y6zl0+0nH0cXtYkHvopu4+Hzg1/ttqR/t9N/j7m/pBpAPAX53oP448MVeOhqCg8ojSvIuun/8FrN14s0H2u4CSXzr0pHHsO0426X9daRhGAgjSrLfxKzkNmv5ucATYzpT+Si6sZQFbHt44Rv76qkv7c/Cj/H072LsDjZo5y16DV0g/DnwOuBvquq0PvuaTjs4x9XEye2cqfwscQNtVnILgSfHeKbyx+iuA/GnbD253bi6mu4kh+sZ04vBDDgNeBndaeF/ql1D5AM99zStJs5xtbcxEIbkTOUpPVFVF/TdxAwxv6qW9N3EDPGvVfVvSbYkmQNsYowGlPdmBsLwBmcq/y5bA2FsZyoD7227Bz7Btucy+lx/LfXmb5N8d1Xd0ncjM8C6dg2EP6TbYvoX4KZeO9JQHEMYkTOVt0ryP4C3AP/IwPUQxnSm8m10Jy27my4cZ/SFUJ4pSUK3tXRfe74AmFNVM/bIGm1lIAwpyQ8DX5w4z3mSX6cbRPwy8PaqurvP/vqQ5A7gpVX1jb576Vs7RcHT7K3nxd8dSdZX1fF996HROVN5eKtop2VI8gbgJ4Cfpjvm/P099tWnLwAv7LuJmaCqvtz+8f9XuqNLJm7j6O+SfE/fTWh0jiEMr6rq6+3xj9JdXH49sD7Jz/bYV58OA+5I8lm2HUMYx8NO30g3tvQddIOoR9Kdy+jYPvvqyWuBn0lyD/A1xnT32d7IQBhekjwf+DrdpLQLB1577tSrPOud03cDM8hv0p3H6a+q6hVJXkt3kZyxkeRFVXUv3bwD7YUMhOH9PnAz8FXg9qpaB9AOQX2gv7b6kWQfutP4Htd3LzPEN6vqoST7JNmnqj6Z5Ly+m5pmHwNeWVVfTvKRqvqxvhvSaAyEIVXVB5P8JXAo3b7zCV8BfqqfrvrTjjP/wsCvwnH3aNuC/DRweZJNwLhdazoDj513sBcyEEZQVf8E/NOk2thtHQw4HLg1yU10+4qB8RpDSHI03VjKUroB5V+guxbCkXSn9RgntZ3H2kt42Kl2WZLvn6peVX893b30JcmfAe+cfJx9kkXAOVX1w/10Nv2SPMXWQeT96cbbYIafv0dbGQjaLe34+4VV9VdJvg2YVVWP993XdEny99sbR0lyS1WN60VytBdyl9GQdnbFp6p6eLp6mSmS/BdgJXAQ8GK6cz29n+4orHGxoyPM9p+2LqQ9wIlpw1sPrGv3m4F/AO5qj8fu4uHNmcBJdEdeUVV30Q26j5PPtmDcRpIVjO+fC+2l3EIY0sRVoNo1lddW1Z+3568DfqDP3nr0ZFV9ozt9DSSZzfgNJr4D+GiSN7M1ABYB+wI/0ldT0q5wDGFEU52nJcm6qlrUV099SfI7wKPAGXRH1PwscFtV/WqfffWhTUSbGEu4taqu77MfaVcYCCNqcxH+H/BHdL+GfwL4vqo6pdfGetAmp60AfqiV/rKqxupCKNKziYEwoja4fA7wfXSB8GngPeM0qJxkKd0pjt/Xnt8EzKX7Pn6lqq7qsz9Ju8ZA2EVJnl9V/9J3H31I8hlg2cA5728GTgaeD3yoqsbpKCPpWcOjjEaU5NXtYii3tecvS3LhTlZ7ttl3Igyav6mqh9spLJ7XV1OSdo+BMLrz6S6n+RBAVX2BbvfRODlw8ElVvW3g6dxp7kXSHmIg7IJJv44Bnuqlkf7cuJ1j7/8rXjtX2ms5D2F09yV5NVBJ9gV+nu5CKOPkF4CPJflx4HOtdjywH3BqX01J2j0OKo8oySHAe+kmowX4BPDz43SU0YQkJ7P1imAeey/t5QyEESU5qao+s7OaJO1tDIQRJflcVb1yZzVJ2ts4hjCkJK8CXg3MTfKLAy/NAWb105Uk7TkGwvD2pZt4NRt4wUD9q8BpvXQkSXuQu4xGlOTIqvpy331I0p5mIAwpye9X1TuS/ClTnOJ5nK4jLOnZyV1Gw7us3f/PXruQpGeIWwiSJMAthJElOQn4DeBIuu8vQFXVd/bZlyTtLrcQRpTkDrpTN6xn4BxGVfVQb01J0h7gFsLoHquqa/puQpL2NLcQRpTkXLqJaH8CPDlRr6rPbXclSdoLGAgjSvLJKcpVVSdPezOStAcZCJIkwDGEoU06fxF0k9P+me7ykXf30JIk7VFeMW14L5h0mwMsAq5JsqzPxiRpT3CX0W5KchDwV57+WtLezi2E3dSulJa++5Ck3WUg7KZ2GclH+u5DknaXg8pDSnILTz/L6UHA/cAZ09+RJO1ZjiEMKcmRk0oFPFRVX+ujH0na0wwESRLgGIIkqTEQJEmAgSBJagwESRJgIEiSmv8Pkvq8csYEl84AAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"df2['gdp_per_cap'].sort_values(ascending=False).plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sauvegarde et exportation\n",
"\n",
"Plusieurs formats sont possibles. Notons Excel, Stata, LaTex, etc. Le format natif Python, qui est très efficace est pickle (pkl). "
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"df2.to_excel('countries.xlsx')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lecture de données\n",
"\n",
"On peut télécharger des données de plus types directement, même du web. Voyons cet exemple célèbre d'ne basee de donnée sur les films (Kaggle IMDB Scores data)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"url = 'https://dq-blog-files.s3.amazonaws.com/movies.xls'"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_excel(url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Il y a beaucoup de films..."
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1338"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Regardons les premiers films"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Title \n",
" Year \n",
" Genres \n",
" Language \n",
" Country \n",
" Content Rating \n",
" Duration \n",
" Aspect Ratio \n",
" Budget \n",
" Gross Earnings \n",
" ... \n",
" Facebook Likes - Actor 1 \n",
" Facebook Likes - Actor 2 \n",
" Facebook Likes - Actor 3 \n",
" Facebook Likes - cast Total \n",
" Facebook likes - Movie \n",
" Facenumber in posters \n",
" User Votes \n",
" Reviews by Users \n",
" Reviews by Crtiics \n",
" IMDB Score \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Intolerance: Love's Struggle Throughout the Ages \n",
" 1916 \n",
" Drama|History|War \n",
" NaN \n",
" USA \n",
" Not Rated \n",
" 123 \n",
" 1.33 \n",
" 385907.0 \n",
" NaN \n",
" ... \n",
" 436 \n",
" 22 \n",
" 9.0 \n",
" 481 \n",
" 691 \n",
" 1 \n",
" 10718 \n",
" 88 \n",
" 69.0 \n",
" 8.0 \n",
" \n",
" \n",
" 1 \n",
" Over the Hill to the Poorhouse \n",
" 1920 \n",
" Crime|Drama \n",
" NaN \n",
" USA \n",
" NaN \n",
" 110 \n",
" 1.33 \n",
" 100000.0 \n",
" 3000000.0 \n",
" ... \n",
" 2 \n",
" 2 \n",
" 0.0 \n",
" 4 \n",
" 0 \n",
" 1 \n",
" 5 \n",
" 1 \n",
" 1.0 \n",
" 4.8 \n",
" \n",
" \n",
" 2 \n",
" The Big Parade \n",
" 1925 \n",
" Drama|Romance|War \n",
" NaN \n",
" USA \n",
" Not Rated \n",
" 151 \n",
" 1.33 \n",
" 245000.0 \n",
" NaN \n",
" ... \n",
" 81 \n",
" 12 \n",
" 6.0 \n",
" 108 \n",
" 226 \n",
" 0 \n",
" 4849 \n",
" 45 \n",
" 48.0 \n",
" 8.3 \n",
" \n",
" \n",
" 3 \n",
" Metropolis \n",
" 1927 \n",
" Drama|Sci-Fi \n",
" German \n",
" Germany \n",
" Not Rated \n",
" 145 \n",
" 1.33 \n",
" 6000000.0 \n",
" 26435.0 \n",
" ... \n",
" 136 \n",
" 23 \n",
" 18.0 \n",
" 203 \n",
" 12000 \n",
" 1 \n",
" 111841 \n",
" 413 \n",
" 260.0 \n",
" 8.3 \n",
" \n",
" \n",
" 4 \n",
" Pandora's Box \n",
" 1929 \n",
" Crime|Drama|Romance \n",
" German \n",
" Germany \n",
" Not Rated \n",
" 110 \n",
" 1.33 \n",
" NaN \n",
" 9950.0 \n",
" ... \n",
" 426 \n",
" 20 \n",
" 3.0 \n",
" 455 \n",
" 926 \n",
" 1 \n",
" 7431 \n",
" 84 \n",
" 71.0 \n",
" 8.0 \n",
" \n",
" \n",
"
\n",
"
5 rows × 25 columns
\n",
"
"
],
"text/plain": [
" Title Year \\\n",
"0 Intolerance: Love's Struggle Throughout the Ages 1916 \n",
"1 Over the Hill to the Poorhouse 1920 \n",
"2 The Big Parade 1925 \n",
"3 Metropolis 1927 \n",
"4 Pandora's Box 1929 \n",
"\n",
" Genres Language Country Content Rating Duration \\\n",
"0 Drama|History|War NaN USA Not Rated 123 \n",
"1 Crime|Drama NaN USA NaN 110 \n",
"2 Drama|Romance|War NaN USA Not Rated 151 \n",
"3 Drama|Sci-Fi German Germany Not Rated 145 \n",
"4 Crime|Drama|Romance German Germany Not Rated 110 \n",
"\n",
" Aspect Ratio Budget Gross Earnings ... Facebook Likes - Actor 1 \\\n",
"0 1.33 385907.0 NaN ... 436 \n",
"1 1.33 100000.0 3000000.0 ... 2 \n",
"2 1.33 245000.0 NaN ... 81 \n",
"3 1.33 6000000.0 26435.0 ... 136 \n",
"4 1.33 NaN 9950.0 ... 426 \n",
"\n",
" Facebook Likes - Actor 2 Facebook Likes - Actor 3 \\\n",
"0 22 9.0 \n",
"1 2 0.0 \n",
"2 12 6.0 \n",
"3 23 18.0 \n",
"4 20 3.0 \n",
"\n",
" Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters \\\n",
"0 481 691 1 \n",
"1 4 0 1 \n",
"2 108 226 0 \n",
"3 203 12000 1 \n",
"4 455 926 1 \n",
"\n",
" User Votes Reviews by Users Reviews by Crtiics IMDB Score \n",
"0 10718 88 69.0 8.0 \n",
"1 5 1 1.0 4.8 \n",
"2 4849 45 48.0 8.3 \n",
"3 111841 413 260.0 8.3 \n",
"4 7431 84 71.0 8.0 \n",
"\n",
"[5 rows x 25 columns]"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regroupement\n",
"\n",
"Pensons à un calcul difficile, comme le nombre de titre par genres pour les 10 genres avec le plus de titre. C'est là que pandas va faire sa magie à l'aide de la fonction `groupby` "
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Genres\n",
"Drama 62\n",
"Comedy|Drama 51\n",
"Comedy 48\n",
"Comedy|Drama|Romance 46\n",
"Drama|Romance 39\n",
"Action|Adventure|Thriller 28\n",
"Crime|Drama 27\n",
"Comedy|Romance 23\n",
"Crime|Drama|Thriller 20\n",
"Comedy|Crime 17\n",
"Name: Title, dtype: int64"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Genres').count()['Title'].sort_values(ascending=False).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Regardons par pays, l'écrasante majorité vient des États-Unis"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Country\n",
"USA 1073\n",
"UK 130\n",
"France 26\n",
"Canada 25\n",
"Australia 18\n",
"Germany 12\n",
"Italy 11\n",
"Japan 9\n",
"Spain 4\n",
"West Germany 3\n",
"Name: Title, dtype: int64"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('Country').count()['Title'].sort_values(ascending=False).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Combien en fait? 82% environ..."
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"co_lead = df.groupby('Country').count()['Title'].sort_values(ascending=False).head(10)"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Country\n",
"USA 0.818459\n",
"UK 0.099161\n",
"France 0.019832\n",
"Canada 0.019069\n",
"Australia 0.013730\n",
"Germany 0.009153\n",
"Italy 0.008391\n",
"Japan 0.006865\n",
"Spain 0.003051\n",
"West Germany 0.002288\n",
"Name: Title, dtype: float64"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"co_lead = co_lead/co_lead.sum()\n",
"co_lead"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"interpreter": {
"hash": "ba2340ab882356406e091df0706039b4b3cc5191eef6c073d3fb97005dbe0324"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}