{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Les données\n", "\n", "Travailler avec des données est assez facile en Python. Le module `pandas` permet de manipuler des grandes bases de données facilement. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Les dictionnaires" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supposons que nous avons deux listes. L'une contient des noms de pays, " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "countries = ['Canada','United States','Germany','France','Italy']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "L'autre contient la population de chacun des pays, trouvé ici" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "pop = [38.068,332.915,83.9,64.426,60.367]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On aimerait pouvoir travailler avec ces données, obtenir la population du Pays en invoquant son nom, etc. Une base de donnée pandas, c'est-à-dire un dataframe n'est rien d'autre qu'un objet qui est crée autour d'un type de données en Python, le dictionnaire. Voyons voir ce qu'est un dictionnaire en le déclarant basé sur deux listes jumelées en utilisant `zip`: " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "map_pop = dict(zip(countries,pop))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maintenant, on veut obtenir la population de l'Allemagne. On n'a qu'à l'invoquer: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "83.9" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "map_pop['Germany']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Un dictionnaire est composé d'une clé et d'items. Voyons voir" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['Canada', 'United States', 'Germany', 'France', 'Italy'])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "map_pop.keys()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_values([38.068, 332.915, 83.9, 64.426, 60.367])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "map_pop.values()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([('Canada', 38.068), ('United States', 332.915), ('Germany', 83.9), ('France', 64.426), ('Italy', 60.367)])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "map_pop.items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On peut aussi composer le dictionnaire de la façon suivante:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "map_pop2 = {'Canada':38.068,'United States':332.815,'Germany':83.9,'France':64.426,'Italy':60.367}" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "83.9" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "map_pop2['Germany']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Un dataframe est construit autour d'un dictionnaire sur des listes (ou arrays). Par exemple, " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({'country':countries,'population':pop})" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrypopulation
0Canada38.068
1United States332.915
2Germany83.900
3France64.426
4Italy60.367
\n", "
" ], "text/plain": [ " country population\n", "0 Canada 38.068\n", "1 United States 332.915\n", "2 Germany 83.900\n", "3 France 64.426\n", "4 Italy 60.367" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On peut aussi créer en utilisant" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame(index=countries,columns=['population'],data=pop)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
population
Canada38.068
United States332.915
Germany83.900
France64.426
Italy60.367
\n", "
" ], "text/plain": [ " population\n", "Canada 38.068\n", "United States 332.915\n", "Germany 83.900\n", "France 64.426\n", "Italy 60.367" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Il y a une différence entre les deux dataframe. Le premier a deux colonnes, country et pop. Il a une colonne en gras qui débute par zéro et qui semble indiqué le numéro de l'observation. Le deuxième n'a pas la variable country mais a plutôt cette colonne qui est en gras and les noms de pays. Cette colonne en gras est l'index. L'avantage de l'index est qu'il me permet d'obtenir la valeur pour un pays plus facilement que si j'avais utilisé la colonne country. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "83.9" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc['Germany','population']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Il n'y a qu'un seul élément, donc c'est un scalaire. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supossons que je veux deux pays, " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Germany 83.900\n", "Italy 60.367\n", "Name: population, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[['Germany','Italy'],'population']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Le résultat n'est pas un scalaire, mais plutôt ce qu'on appelle une `Series` de Pandas, puisque c'est seulement une colonne." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Je peux faire un sort sur mon index. " ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
population
Canada38.068
France64.426
Germany83.900
Italy60.367
United States332.915
\n", "
" ], "text/plain": [ " population\n", "Canada 38.068\n", "France 64.426\n", "Germany 83.900\n", "Italy 60.367\n", "United States 332.915" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Je peux rajouter une variable, le PIB de chaque pays, en milliards" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "gdp = [1711.39,20494.1,4000.39,2775.25,2072.2]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "df2['gdp'] = gdp" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationgdp
Canada38.0681711.39
United States332.91520494.10
Germany83.9004000.39
France64.4262775.25
Italy60.3672072.20
\n", "
" ], "text/plain": [ " population gdp\n", "Canada 38.068 1711.39\n", "United States 332.915 20494.10\n", "Germany 83.900 4000.39\n", "France 64.426 2775.25\n", "Italy 60.367 2072.20" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Les fonctions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Un paquet de statistiques sont disponible sous pandas" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "population 115.9352\n", "gdp 6210.6660\n", "dtype: float64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.mean()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "population 579.676\n", "gdp 31053.330\n", "dtype: float64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.sum()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationgdp
count5.0000005.000000
mean115.9352006210.666000
std122.3834258032.345163
min38.0680001711.390000
25%60.3670002072.200000
50%64.4260002775.250000
75%83.9000004000.390000
max332.91500020494.100000
\n", "
" ], "text/plain": [ " population gdp\n", "count 5.000000 5.000000\n", "mean 115.935200 6210.666000\n", "std 122.383425 8032.345163\n", "min 38.068000 1711.390000\n", "25% 60.367000 2072.200000\n", "50% 64.426000 2775.250000\n", "75% 83.900000 4000.390000\n", "max 332.915000 20494.100000" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On peut transposer le dernier tableau" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
population5.0115.9352122.38342538.06860.36764.42683.90332.915
gdp5.06210.66608032.3451631711.3902072.2002775.2504000.3920494.100
\n", "
" ], "text/plain": [ " count mean std min 25% 50% \\\n", "population 5.0 115.9352 122.383425 38.068 60.367 64.426 \n", "gdp 5.0 6210.6660 8032.345163 1711.390 2072.200 2775.250 \n", "\n", " 75% max \n", "population 83.90 332.915 \n", "gdp 4000.39 20494.100 " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.describe().transpose()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Les calculs sur les colonnes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Le nom des colonnes se trouve dans " ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['population', 'gdp'], dtype='object')" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculons le PIB par habitant" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "df2['gdp_per_cap'] = df2['gdp']*1e3/df2['population']" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationgdpgdp_per_cap
Canada38.0681711.3944956.131134
United States332.91520494.1061559.557244
Germany83.9004000.3947680.452920
France64.4262775.2543076.552944
Italy60.3672072.2034326.701675
\n", "
" ], "text/plain": [ " population gdp gdp_per_cap\n", "Canada 38.068 1711.39 44956.131134\n", "United States 332.915 20494.10 61559.557244\n", "Germany 83.900 4000.39 47680.452920\n", "France 64.426 2775.25 43076.552944\n", "Italy 60.367 2072.20 34326.701675" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On peut classer sur la base du PIB per capita" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationgdpgdp_per_cap
United States332.91520494.1061559.557244
Germany83.9004000.3947680.452920
Canada38.0681711.3944956.131134
France64.4262775.2543076.552944
Italy60.3672072.2034326.701675
\n", "
" ], "text/plain": [ " population gdp gdp_per_cap\n", "United States 332.915 20494.10 61559.557244\n", "Germany 83.900 4000.39 47680.452920\n", "Canada 38.068 1711.39 44956.131134\n", "France 64.426 2775.25 43076.552944\n", "Italy 60.367 2072.20 34326.701675" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.sort_values(by='gdp_per_cap',ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On peut référer aux colonnes en utilisant deux notations" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Canada 44956.131134\n", "United States 61559.557244\n", "Germany 47680.452920\n", "France 43076.552944\n", "Italy 34326.701675\n", "Name: gdp_per_cap, dtype: float64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.gdp_per_cap" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Canada 44956.131134\n", "United States 61559.557244\n", "Germany 47680.452920\n", "France 43076.552944\n", "Italy 34326.701675\n", "Name: gdp_per_cap, dtype: float64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2['gdp_per_cap']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Une nouvelle colonne doit être crée par la dernière notation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supposons une autre base de donnée qui contient le continent des pays" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "df3 = pd.DataFrame(index=countries,columns=['continent'],data=['North America','North America','Europe','Europe','Europe'])" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continent
CanadaNorth America
United StatesNorth America
GermanyEurope
FranceEurope
ItalyEurope
\n", "
" ], "text/plain": [ " continent\n", "Canada North America\n", "United States North America\n", "Germany Europe\n", "France Europe\n", "Italy Europe" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On peut joindre ces données dans notre première base de donnée" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "df4 = df2.merge(df3,left_index=True,right_index=True)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationgdpgdp_per_capcontinent
Canada38.0681711.3944956.131134North America
United States332.91520494.1061559.557244North America
Germany83.9004000.3947680.452920Europe
France64.4262775.2543076.552944Europe
Italy60.3672072.2034326.701675Europe
\n", "
" ], "text/plain": [ " population gdp gdp_per_cap continent\n", "Canada 38.068 1711.39 44956.131134 North America\n", "United States 332.915 20494.10 61559.557244 North America\n", "Germany 83.900 4000.39 47680.452920 Europe\n", "France 64.426 2775.25 43076.552944 Europe\n", "Italy 60.367 2072.20 34326.701675 Europe" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Les graphiques\n", "\n", "On peut faire des graphiques directement à partir d'un objet pandas!" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAEyCAYAAAD6Lqe7AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAccElEQVR4nO3dfdhc9V3n8feHpFBsG8pDQEwowZJ1C9gnItKyl7ZEJd3WBhWuTa0lanazVqqturpQrdhqXPBaxbIWuljaBsSFLLUlWtEitHatCE1aWuRJYqEQoSTyVGyFNvjdP87vbiY3d5KZJNznDvN+XddcM/Odc4bvDEk+c87v/M5JVSFJ0j59NyBJmhkMBEkSYCBIkhoDQZIEGAiSpMZAkCQBMLvvBnbVIYccUgsWLOi7DUnaq6xfv/6fq2ruVK/ttYGwYMEC1q1b13cbkrRXSfLl7b3mLiNJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWr22olpe8KCsz7edwvcc+7r+25BkgC3ECRJzVCBkOSFSa5KckeS25O8KslBSa5Ncle7P3Bg+bOTbEhyZ5JTBurHJ7mlvXZBkrT6fkmubPUbkyzY459UkrRDw24hvBf4i6r698DLgNuBs4DrqmohcF17TpJjgGXAscAS4MIks9r7XASsBBa225JWXwE8UlVHA+cD5+3m55IkjWingZBkDvB9wCUAVfWNqnoUWAqsboutBk5tj5cCV1TVk1V1N7ABOCHJ4cCcqrqhqgq4dNI6E+91FbB4YutBkjQ9htlC+E5gM/ChJJ9P8oEkzwMOq6oHANr9oW35ecB9A+tvbLV57fHk+jbrVNUW4DHg4MmNJFmZZF2SdZs3bx7yI0qShjFMIMwGXglcVFWvAL5G2z20HVP9sq8d1He0zraFqouralFVLZo7d8rTeUuSdtEwgbAR2FhVN7bnV9EFxINtNxDtftPA8kcMrD8fuL/V509R32adJLOBA4CHR/0wkqRdt9NAqKqvAPcl+a5WWgzcBqwFlrfacuDq9ngtsKwdOXQU3eDxTW230uNJTmzjA2dMWmfivU4Drm/jDJKkaTLsxLSfAy5Psi/wJeCn6MJkTZIVwL3A6QBVdWuSNXShsQU4s6qeau/zVuDDwP7ANe0G3YD1ZUk20G0ZLNvNzyVJGtFQgVBVNwOLpnhp8XaWXwWsmqK+DjhuivoTtECRJPXDmcqSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUDBUISe5JckuSm5Osa7WDklyb5K52f+DA8mcn2ZDkziSnDNSPb++zIckFSdLq+yW5stVvTLJgD39OSdJOjLKF8NqqenlVLWrPzwKuq6qFwHXtOUmOAZYBxwJLgAuTzGrrXASsBBa225JWXwE8UlVHA+cD5+36R5Ik7Yrd2WW0FFjdHq8GTh2oX1FVT1bV3cAG4IQkhwNzquqGqirg0knrTLzXVcDiia0HSdL0GDYQCvhEkvVJVrbaYVX1AEC7P7TV5wH3Day7sdXmtceT69usU1VbgMeAg0f7KJKk3TF7yOVOqqr7kxwKXJvkjh0sO9Uv+9pBfUfrbPvGXRitBHjRi160444lSSMZaguhqu5v95uAjwInAA+23UC0+01t8Y3AEQOrzwfub/X5U9S3WSfJbOAA4OEp+ri4qhZV1aK5c+cO07okaUg7DYQkz0vygonHwA8Bfw+sBZa3xZYDV7fHa4Fl7ciho+gGj29qu5UeT3JiGx84Y9I6E+91GnB9G2eQJE2TYXYZHQZ8tI3xzgb+uKr+IslngTVJVgD3AqcDVNWtSdYAtwFbgDOr6qn2Xm8FPgzsD1zTbgCXAJcl2UC3ZbBsD3w2SdIIdhoIVfUl4GVT1B8CFm9nnVXAqinq64Djpqg/QQsUSVI/nKksSQIMBElSYyBIkoDh5yHoWW7BWR/vuwXuOff1fbcgjTW3ECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqXFimjSJk/Q0rtxCkCQBBoIkqTEQJEmAgSBJagwESRLgUUaSdsAjrsaLWwiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVIzdCAkmZXk80n+rD0/KMm1Se5q9wcOLHt2kg1J7kxyykD9+CS3tNcuSJJW3y/Jla1+Y5IFe/AzSpKGMMoWwtuB2weenwVcV1ULgevac5IcAywDjgWWABcmmdXWuQhYCSxstyWtvgJ4pKqOBs4HztulTyNJ2mVDBUKS+cDrgQ8MlJcCq9vj1cCpA/UrqurJqrob2ACckORwYE5V3VBVBVw6aZ2J97oKWDyx9SBJmh7DbiH8PvArwL8N1A6rqgcA2v2hrT4PuG9guY2tNq89nlzfZp2q2gI8Bhw87IeQJO2+nQZCkjcAm6pq/ZDvOdUv+9pBfUfrTO5lZZJ1SdZt3rx5yHYkScMYZgvhJOCNSe4BrgBOTvJHwINtNxDtflNbfiNwxMD684H7W33+FPVt1kkyGzgAeHhyI1V1cVUtqqpFc+fOHeoDSpKGs9NAqKqzq2p+VS2gGyy+vqp+AlgLLG+LLQeubo/XAsvakUNH0Q0e39R2Kz2e5MQ2PnDGpHUm3uu09t942haCJOmZsztnOz0XWJNkBXAvcDpAVd2aZA1wG7AFOLOqnmrrvBX4MLA/cE27AVwCXJZkA92WwbLd6EuStAtGCoSq+hTwqfb4IWDxdpZbBayaor4OOG6K+hO0QJEk9cOZypIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSszvXQ5CksbHgrI/33QL3nPv6Z/T93UKQJAEGgiSpMRAkSYCBIElqDARJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKnZaSAkeW6Sm5J8IcmtSd7d6gcluTbJXe3+wIF1zk6yIcmdSU4ZqB+f5Jb22gVJ0ur7Jbmy1W9MsuAZ+KySpB0YZgvhSeDkqnoZ8HJgSZITgbOA66pqIXBde06SY4BlwLHAEuDCJLPae10ErAQWttuSVl8BPFJVRwPnA+ft/keTJI1ip4FQnX9pT5/TbgUsBVa3+mrg1PZ4KXBFVT1ZVXcDG4ATkhwOzKmqG6qqgEsnrTPxXlcBiye2HiRJ02OoMYQks5LcDGwCrq2qG4HDquoBgHZ/aFt8HnDfwOobW21eezy5vs06VbUFeAw4eIo+ViZZl2Td5s2bh/qAkqThDBUIVfVUVb0cmE/3a/+4HSw+1S/72kF9R+tM7uPiqlpUVYvmzp27k64lSaMY6SijqnoU+BTdvv8H224g2v2mtthG4IiB1eYD97f6/Cnq26yTZDZwAPDwKL1JknbPMEcZzU3ywvZ4f+AHgDuAtcDytthy4Or2eC2wrB05dBTd4PFNbbfS40lObOMDZ0xaZ+K9TgOub+MMkqRpMswV0w4HVrcjhfYB1lTVnyW5AViTZAVwL3A6QFXdmmQNcBuwBTizqp5q7/VW4MPA/sA17QZwCXBZkg10WwbL9sSHkyQNb6eBUFVfBF4xRf0hYPF21lkFrJqivg542vhDVT1BCxRJUj+cqSxJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEnNTgMhyRFJPpnk9iS3Jnl7qx+U5Nokd7X7AwfWOTvJhiR3JjlloH58klvaaxckSavvl+TKVr8xyYJn4LNKknZgmC2ELcAvVdVLgBOBM5McA5wFXFdVC4Hr2nPaa8uAY4ElwIVJZrX3ughYCSxstyWtvgJ4pKqOBs4HztsDn02SNIKdBkJVPVBVn2uPHwduB+YBS4HVbbHVwKnt8VLgiqp6sqruBjYAJyQ5HJhTVTdUVQGXTlpn4r2uAhZPbD1IkqbHSGMIbVfOK4AbgcOq6gHoQgM4tC02D7hvYLWNrTavPZ5c32adqtoCPAYcPEpvkqTdM3QgJHk+8BHgHVX11R0tOkWtdlDf0TqTe1iZZF2SdZs3b95Zy5KkEQwVCEmeQxcGl1fVn7Tyg203EO1+U6tvBI4YWH0+cH+rz5+ivs06SWYDBwAPT+6jqi6uqkVVtWju3LnDtC5JGtIwRxkFuAS4vap+b+CltcDy9ng5cPVAfVk7cugousHjm9pupceTnNje84xJ60y812nA9W2cQZI0TWYPscxJwFuAW5Lc3GrvBM4F1iRZAdwLnA5QVbcmWQPcRneE0plV9VRb763Ah4H9gWvaDbrAuSzJBrotg2W797EkSaPaaSBU1d8w9T5+gMXbWWcVsGqK+jrguCnqT9ACRZLUD2cqS5IAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJKAIQIhyQeTbEry9wO1g5Jcm+Sudn/gwGtnJ9mQ5M4kpwzUj09yS3vtgiRp9f2SXNnqNyZZsIc/oyRpCMNsIXwYWDKpdhZwXVUtBK5rz0lyDLAMOLatc2GSWW2di4CVwMJ2m3jPFcAjVXU0cD5w3q5+GEnSrttpIFTVp4GHJ5WXAqvb49XAqQP1K6rqyaq6G9gAnJDkcGBOVd1QVQVcOmmdife6Clg8sfUgSZo+uzqGcFhVPQDQ7g9t9XnAfQPLbWy1ee3x5Po261TVFuAx4OBd7EuStIv29KDyVL/sawf1Ha3z9DdPViZZl2Td5s2bd7FFSdJUdjUQHmy7gWj3m1p9I3DEwHLzgftbff4U9W3WSTIbOICn76ICoKourqpFVbVo7ty5u9i6JGkquxoIa4Hl7fFy4OqB+rJ25NBRdIPHN7XdSo8nObGND5wxaZ2J9zoNuL6NM0iSptHsnS2Q5P8ArwEOSbIROAc4F1iTZAVwL3A6QFXdmmQNcBuwBTizqp5qb/VWuiOW9geuaTeAS4DLkmyg2zJYtkc+mSRpJDsNhKp603ZeWryd5VcBq6aorwOOm6L+BC1QJEn9caayJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAmYQYGQZEmSO5NsSHJW3/1I0riZEYGQZBbwPuB1wDHAm5Ic029XkjReZkQgACcAG6rqS1X1DeAKYGnPPUnSWElV9d0DSU4DllTVf27P3wJ8b1W9bdJyK4GV7el3AXdOa6NTOwT4576bmCH8Ljp+D1v5XWw1U76LI6tq7lQvzJ7uTrYjU9SellRVdTFw8TPfzvCSrKuqRX33MRP4XXT8Hrbyu9hqb/guZsouo43AEQPP5wP399SLJI2lmRIInwUWJjkqyb7AMmBtzz1J0liZEbuMqmpLkrcBfwnMAj5YVbf23NawZtQurJ75XXT8Hrbyu9hqxn8XM2JQWZLUv5myy0iS1DMDQZIEGAi7Jck+Seb03Yck7QkGwoiS/HGSOUmeB9wG3Jnkl/vuS9LMk+S4vnsYhYEwumOq6qvAqcCfAy8C3tJrRz1J8rYkB/bdx0yS5NAkL5q49d1PH5J8W5J3JfnD9nxhkjf03VdP3p/kpiQ/m+SFfTezMwbC6J6T5Dl0gXB1VX2TKWZVj4lvBz6bZE07W+1UM87HQpI3JrkLuBv4a+Ae4Jpem+rPh4AngVe15xuB3+qvnf5U1X8A3kw38XZd28Pwgz23tV0Gwuj+N91f9ucBn05yJPDVXjvqSVX9GrAQuAT4SeCuJL+d5MW9NtaP3wROBP6hqo4CFgOf6bel3ry4qn4H+CZAVf0rU5+eZixU1V3ArwH/Hfh+4IIkdyT50X47ezoDYURVdUFVzauq/1idLwOv7buvvlQ3keUr7bYFOBC4Ksnv9NrY9PtmVT0E7JNkn6r6JPDynnvqyzeS7E/bcm4/EJ7st6V+JHlpkvOB24GTgR+uqpe0x+f32twUZsRM5b1JksOA3wa+o6pe167b8Cq6X8ljJcnPA8vpzuD4AeCXq+qbSfYB7gJ+pc/+ptmjSZ4PfBq4PMkmuoAcR+cAfwEckeRy4CS6Lchx9Ad0fzfe2baUAKiq+5P8Wn9tTc2ZyiNKcg3dPtJfraqXJZkNfL6qvrvn1qZdkvcAl7StpMmvvaSqbu+hrV60o86eoNs18mbgAODyttUwdpIcTLcLLcDfVdVMOO2zdsJAGFGSz1bV9yT5fFW9otVurqqX99xaL9rV7g5jYGuzqu7tryP1LcmPANdX1WPt+QuB11TVx/rsazoluYWpDzYJ3Z7Wl05zS0Nxl9HovtZ+/UzsHz0ReKzflvrRTkj4G8CDwL+1cgEz8g/7MyHJ4+zgKLOqGseJi+dU1UcnnlTVo0nOAT7WX0vTbq88zNZAGN0v0p2a+8VJPgPMBU7vt6XevAP4rnHdLQJQVS+Ab+0++wpwGVt3G72gx9b6NNXBKmP1b81Uu1H3Bu4yGlGS/YCn6C7hGbrLeO5TVWN3FEWSTwI/WFXjOnj6LUlurKrv3VltHCT5IPAo8D66raefAw6sqp/ssa1etD0I/wt4CbAv3en9vzZTtxzHKrX3kBuq6pXAt67XkORzwCv7a6k3XwI+leTjDBxWWFW/119LvXkqyZuBK+j+EXwT3Q+HcfRzwLuAK+l+NH0COLPXjvrzB3QX/Pq/wCLgDODoXjvaAQNhSEm+HZgH7J/kFWydaDMH+LbeGuvXve22b7uNsx8H3ttuRTcp7cd77agnVfU14Ky++5gpqmpDkllV9RTwoSR/23dP22MgDO8UumOp5wODv4AfB97ZR0N9q6p3993DTFFV9wBL++5jJkjy74D/Bixg26PPTu6rpx59vV0W+OY2WfMBurMczEiOIYwoyY9V1Uf67mMmSDKXbvLZscBzJ+rj+Bc/yXOBFTz9u/jp3prqSZIvAO8H1jOw26yq1vfWVE/aqW0epNuC/gW6+Snvq6p/7LWx7XALYURV9ZEkr+fpf/Hf019Xvbmcbj/xG4CfoZu1vLnXjvpzGXAH3Zbke+iOMhqbiXmTbKmqi/puYoY4tareSzdp8d0ASd5Ot2txxvFcRiNK8n7gP9ENnIXukNMje22qPwdX1SV05/H56/Zr+MS+m+rJ0VX1LrojSFYDrwfGbvZ686ftdM+HJzlo4tZ3Uz1ZPkXtJ6e7iWG5hTC6V1fVS5N8sareneR3gT/pu6mefLPdP9C2mu6nG2MZRxPfxaPtoihfoduHPo4m/hEcvHBUAd/ZQy+9SPImuoMKjkqyduClFwAzdt6OgTC6iRNUfT3Jd9D9zz2qx3769FtJDgB+ie5Y6zl0+0nH0cXtYkHvopu4+Hzg1/ttqR/t9N/j7m/pBpAPAX53oP448MVeOhqCg8ojSvIuun/8FrN14s0H2u4CSXzr0pHHsO0426X9daRhGAgjSrLfxKzkNmv5ucATYzpT+Si6sZQFbHt44Rv76qkv7c/Cj/H072LsDjZo5y16DV0g/DnwOuBvquq0PvuaTjs4x9XEye2cqfwscQNtVnILgSfHeKbyx+iuA/GnbD253bi6mu4kh+sZ04vBDDgNeBndaeF/ql1D5AM99zStJs5xtbcxEIbkTOUpPVFVF/TdxAwxv6qW9N3EDPGvVfVvSbYkmQNsYowGlPdmBsLwBmcq/y5bA2FsZyoD7227Bz7Btucy+lx/LfXmb5N8d1Xd0ncjM8C6dg2EP6TbYvoX4KZeO9JQHEMYkTOVt0ryP4C3AP/IwPUQxnSm8m10Jy27my4cZ/SFUJ4pSUK3tXRfe74AmFNVM/bIGm1lIAwpyQ8DX5w4z3mSX6cbRPwy8PaqurvP/vqQ5A7gpVX1jb576Vs7RcHT7K3nxd8dSdZX1fF996HROVN5eKtop2VI8gbgJ4Cfpjvm/P099tWnLwAv7LuJmaCqvtz+8f9XuqNLJm7j6O+SfE/fTWh0jiEMr6rq6+3xj9JdXH49sD7Jz/bYV58OA+5I8lm2HUMYx8NO30g3tvQddIOoR9Kdy+jYPvvqyWuBn0lyD/A1xnT32d7IQBhekjwf+DrdpLQLB1577tSrPOud03cDM8hv0p3H6a+q6hVJXkt3kZyxkeRFVXUv3bwD7YUMhOH9PnAz8FXg9qpaB9AOQX2gv7b6kWQfutP4Htd3LzPEN6vqoST7JNmnqj6Z5Ly+m5pmHwNeWVVfTvKRqvqxvhvSaAyEIVXVB5P8JXAo3b7zCV8BfqqfrvrTjjP/wsCvwnH3aNuC/DRweZJNwLhdazoDj513sBcyEEZQVf8E/NOk2thtHQw4HLg1yU10+4qB8RpDSHI03VjKUroB5V+guxbCkXSn9RgntZ3H2kt42Kl2WZLvn6peVX893b30JcmfAe+cfJx9kkXAOVX1w/10Nv2SPMXWQeT96cbbYIafv0dbGQjaLe34+4VV9VdJvg2YVVWP993XdEny99sbR0lyS1WN60VytBdyl9GQdnbFp6p6eLp6mSmS/BdgJXAQ8GK6cz29n+4orHGxoyPM9p+2LqQ9wIlpw1sPrGv3m4F/AO5qj8fu4uHNmcBJdEdeUVV30Q26j5PPtmDcRpIVjO+fC+2l3EIY0sRVoNo1lddW1Z+3568DfqDP3nr0ZFV9ozt9DSSZzfgNJr4D+GiSN7M1ABYB+wI/0ldT0q5wDGFEU52nJcm6qlrUV099SfI7wKPAGXRH1PwscFtV/WqfffWhTUSbGEu4taqu77MfaVcYCCNqcxH+H/BHdL+GfwL4vqo6pdfGetAmp60AfqiV/rKqxupCKNKziYEwoja4fA7wfXSB8GngPeM0qJxkKd0pjt/Xnt8EzKX7Pn6lqq7qsz9Ju8ZA2EVJnl9V/9J3H31I8hlg2cA5728GTgaeD3yoqsbpKCPpWcOjjEaU5NXtYii3tecvS3LhTlZ7ttl3Igyav6mqh9spLJ7XV1OSdo+BMLrz6S6n+RBAVX2BbvfRODlw8ElVvW3g6dxp7kXSHmIg7IJJv44Bnuqlkf7cuJ1j7/8rXjtX2ms5D2F09yV5NVBJ9gV+nu5CKOPkF4CPJflx4HOtdjywH3BqX01J2j0OKo8oySHAe+kmowX4BPDz43SU0YQkJ7P1imAeey/t5QyEESU5qao+s7OaJO1tDIQRJflcVb1yZzVJ2ts4hjCkJK8CXg3MTfKLAy/NAWb105Uk7TkGwvD2pZt4NRt4wUD9q8BpvXQkSXuQu4xGlOTIqvpy331I0p5mIAwpye9X1TuS/ClTnOJ5nK4jLOnZyV1Gw7us3f/PXruQpGeIWwiSJMAthJElOQn4DeBIuu8vQFXVd/bZlyTtLrcQRpTkDrpTN6xn4BxGVfVQb01J0h7gFsLoHquqa/puQpL2NLcQRpTkXLqJaH8CPDlRr6rPbXclSdoLGAgjSvLJKcpVVSdPezOStAcZCJIkwDGEoU06fxF0k9P+me7ykXf30JIk7VFeMW14L5h0mwMsAq5JsqzPxiRpT3CX0W5KchDwV57+WtLezi2E3dSulJa++5Ck3WUg7KZ2GclH+u5DknaXg8pDSnILTz/L6UHA/cAZ09+RJO1ZjiEMKcmRk0oFPFRVX+ujH0na0wwESRLgGIIkqTEQJEmAgSBJagwESRJgIEiSmv8Pkvq8csYEl84AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df2['gdp_per_cap'].sort_values(ascending=False).plot.bar()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sauvegarde et exportation\n", "\n", "Plusieurs formats sont possibles. Notons Excel, Stata, LaTex, etc. Le format natif Python, qui est très efficace est pickle (pkl). " ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "df2.to_excel('countries.xlsx')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lecture de données\n", "\n", "On peut télécharger des données de plus types directement, même du web. Voyons cet exemple célèbre d'ne basee de donnée sur les films (Kaggle IMDB Scores data)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "url = 'https://dq-blog-files.s3.amazonaws.com/movies.xls'" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "df = pd.read_excel(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Il y a beaucoup de films..." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1338" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regardons les premiers films" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleYearGenresLanguageCountryContent RatingDurationAspect RatioBudgetGross Earnings...Facebook Likes - Actor 1Facebook Likes - Actor 2Facebook Likes - Actor 3Facebook Likes - cast TotalFacebook likes - MovieFacenumber in postersUser VotesReviews by UsersReviews by CrtiicsIMDB Score
0Intolerance: Love's Struggle Throughout the Ages1916Drama|History|WarNaNUSANot Rated1231.33385907.0NaN...436229.04816911107188869.08.0
1Over the Hill to the Poorhouse1920Crime|DramaNaNUSANaN1101.33100000.03000000.0...220.0401511.04.8
2The Big Parade1925Drama|Romance|WarNaNUSANot Rated1511.33245000.0NaN...81126.0108226048494548.08.3
3Metropolis1927Drama|Sci-FiGermanGermanyNot Rated1451.336000000.026435.0...1362318.0203120001111841413260.08.3
4Pandora's Box1929Crime|Drama|RomanceGermanGermanyNot Rated1101.33NaN9950.0...426203.0455926174318471.08.0
\n", "

5 rows × 25 columns

\n", "
" ], "text/plain": [ " Title Year \\\n", "0 Intolerance: Love's Struggle Throughout the Ages  1916 \n", "1 Over the Hill to the Poorhouse  1920 \n", "2 The Big Parade  1925 \n", "3 Metropolis  1927 \n", "4 Pandora's Box  1929 \n", "\n", " Genres Language Country Content Rating Duration \\\n", "0 Drama|History|War NaN USA Not Rated 123 \n", "1 Crime|Drama NaN USA NaN 110 \n", "2 Drama|Romance|War NaN USA Not Rated 151 \n", "3 Drama|Sci-Fi German Germany Not Rated 145 \n", "4 Crime|Drama|Romance German Germany Not Rated 110 \n", "\n", " Aspect Ratio Budget Gross Earnings ... Facebook Likes - Actor 1 \\\n", "0 1.33 385907.0 NaN ... 436 \n", "1 1.33 100000.0 3000000.0 ... 2 \n", "2 1.33 245000.0 NaN ... 81 \n", "3 1.33 6000000.0 26435.0 ... 136 \n", "4 1.33 NaN 9950.0 ... 426 \n", "\n", " Facebook Likes - Actor 2 Facebook Likes - Actor 3 \\\n", "0 22 9.0 \n", "1 2 0.0 \n", "2 12 6.0 \n", "3 23 18.0 \n", "4 20 3.0 \n", "\n", " Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters \\\n", "0 481 691 1 \n", "1 4 0 1 \n", "2 108 226 0 \n", "3 203 12000 1 \n", "4 455 926 1 \n", "\n", " User Votes Reviews by Users Reviews by Crtiics IMDB Score \n", "0 10718 88 69.0 8.0 \n", "1 5 1 1.0 4.8 \n", "2 4849 45 48.0 8.3 \n", "3 111841 413 260.0 8.3 \n", "4 7431 84 71.0 8.0 \n", "\n", "[5 rows x 25 columns]" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regroupement\n", "\n", "Pensons à un calcul difficile, comme le nombre de titre par genres pour les 10 genres avec le plus de titre. C'est là que pandas va faire sa magie à l'aide de la fonction `groupby` " ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Genres\n", "Drama 62\n", "Comedy|Drama 51\n", "Comedy 48\n", "Comedy|Drama|Romance 46\n", "Drama|Romance 39\n", "Action|Adventure|Thriller 28\n", "Crime|Drama 27\n", "Comedy|Romance 23\n", "Crime|Drama|Thriller 20\n", "Comedy|Crime 17\n", "Name: Title, dtype: int64" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('Genres').count()['Title'].sort_values(ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regardons par pays, l'écrasante majorité vient des États-Unis" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Country\n", "USA 1073\n", "UK 130\n", "France 26\n", "Canada 25\n", "Australia 18\n", "Germany 12\n", "Italy 11\n", "Japan 9\n", "Spain 4\n", "West Germany 3\n", "Name: Title, dtype: int64" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('Country').count()['Title'].sort_values(ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combien en fait? 82% environ..." ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "co_lead = df.groupby('Country').count()['Title'].sort_values(ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Country\n", "USA 0.818459\n", "UK 0.099161\n", "France 0.019832\n", "Canada 0.019069\n", "Australia 0.013730\n", "Germany 0.009153\n", "Italy 0.008391\n", "Japan 0.006865\n", "Spain 0.003051\n", "West Germany 0.002288\n", "Name: Title, dtype: float64" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "co_lead = co_lead/co_lead.sum()\n", "co_lead" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "interpreter": { "hash": "ba2340ab882356406e091df0706039b4b3cc5191eef6c073d3fb97005dbe0324" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }