{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas Profiling: ejemplo de meteoritos de la NASA\n", "\n", "* Fuente original de los datos: [https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh)\n", "* Ejemplo original del notebook [https://docs.profiling.ydata.ai/latest/getting-started/examples/](https://docs.profiling.ydata.ai/latest/getting-started/examples/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/01.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/02.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "El conjunto de datos Meteorite Landings de NASA contiene información detallada sobre meteoritos que han sido registrados en la Tierra. Los campos principales incluyen:\n", "\n", "* name: Nombre del meteorito.\n", "* id: Identificador único para cada registro.\n", "* nametype: Tipo de nombre (por ejemplo, \"Valid\" para meteoritos confirmados).\n", "* recclass: Clasificación del meteorito según su composición.\n", "* mass (g): Masa del meteorito en gramos.\n", "* fall: Indica si el meteorito fue observado cayendo (\"Fell\") o fue encontrado posteriormente (\"Found\").\n", "* year: Año en que el meteorito fue encontrado o cayó.\n", "* reclat y reclong: Latitud y longitud de las coordenadas donde se registró el hallazgo.\n", "* GeoLocation: Ubicación geográfica combinada en formato de texto." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importar librerias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[ipywidgets](https://ipywidgets.readthedocs.io/en/stable/) es una librería de Python que facilita la creación de widgets interactivos en notebooks Jupyter" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Channels:\n", " - defaults\n", " - conda-forge\n", "Platform: linux-64\n", "Collecting package metadata (repodata.json): done\n", "Solving environment: done\n", "\n", "## Package Plan ##\n", "\n", " environment location: /home/ir_inf/anaconda3/envs/pandas_profiling\n", "\n", " added / updated specs:\n", " - ipywidgets\n", "\n", "\n", "The following packages will be downloaded:\n", "\n", " package | build\n", " ---------------------------|-----------------\n", " ipywidgets-8.1.2 | py312h06a4308_0 244 KB\n", " jupyterlab_widgets-3.0.10 | py312h06a4308_0 195 KB\n", " widgetsnbextension-4.0.10 | py312h06a4308_0 947 KB\n", " ------------------------------------------------------------\n", " Total: 1.4 MB\n", "\n", "The following NEW packages will be INSTALLED:\n", "\n", " ipywidgets pkgs/main/linux-64::ipywidgets-8.1.2-py312h06a4308_0 \n", " jupyterlab_widgets pkgs/main/linux-64::jupyterlab_widgets-3.0.10-py312h06a4308_0 \n", " widgetsnbextension pkgs/main/linux-64::widgetsnbextension-4.0.10-py312h06a4308_0 \n", "\n", "\n", "\n", "Downloading and Extracting Packages:\n", "widgetsnbextension-4 | 947 KB | | 0% \n", "ipywidgets-8.1.2 | 244 KB | | 0% \u001b[A\n", "\n", "jupyterlab_widgets-3 | 195 KB | | 0% \u001b[A\u001b[A\n", "ipywidgets-8.1.2 | 244 KB | ##4 | 7% \u001b[A\n", "\n", "widgetsnbextension-4 | 947 KB | #8 | 5% \u001b[A\u001b[A\n", "\n", "jupyterlab_widgets-3 | 195 KB | ##################################### | 100% \u001b[A\u001b[A\n", "ipywidgets-8.1.2 | 244 KB | #############################1 | 79% \u001b[A\n", " \u001b[A\n", " \u001b[A\n", "\n", " \u001b[A\u001b[A\n", "Preparing transaction: done\n", "Verifying transaction: done\n", "Executing transaction: done\n" ] } ], "source": [ "!conda install ipywidgets --yes" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import requests\n", "\n", "import ydata_profiling\n", "from ydata_profiling.utils.cache import cache_file\n", "import ipywidgets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cargar y preparar un conjunto de datos de ejemplo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Este código combina la descarga y carga de datos desde una URL." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La función cache_file:\n", "\n", "* Descarga el archivo CSV desde la URL proporcionada.\n", "* Lo guarda localmente con el nombre meteorites.csv (o verifica si ya existe en el caché para evitar descargarlo nuevamente).\n", "* Retorna la ruta local al archivo, que se almacena en file_name." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/ir_inf/anaconda3/envs/pandas_profiling/lib/python3.12/data/meteorites.csv')" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file_name = cache_file(\n", " \"meteorites.csv\",\n", " \"https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD\",\n", ")\n", "\n", "file_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Leer el archivo en un DataFrame:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameidnametyperecclassmass (g)fallyearreclatreclongGeoLocation
0Aachen1ValidL521.0Fell1880.050.775006.08333(50.775, 6.08333)
1Aarhus2ValidH6720.0Fell1951.056.1833310.23333(56.18333, 10.23333)
2Abee6ValidEH4107000.0Fell1952.054.21667-113.00000(54.21667, -113.0)
3Acapulco10ValidAcapulcoite1914.0Fell1976.016.88333-99.90000(16.88333, -99.9)
4Achiras370ValidL6780.0Fell1902.0-33.16667-64.95000(-33.16667, -64.95)
.................................
45711Zillah 00231356ValidEucrite172.0Found1990.029.0370017.01850(29.037, 17.0185)
45712Zinder30409ValidPallasite, ungrouped46.0Found1999.013.783338.96667(13.78333, 8.96667)
45713Zlin30410ValidH43.3Found1939.049.2500017.66667(49.25, 17.66667)
45714Zubkovsky31357ValidL62167.0Found2003.049.7891741.50460(49.78917, 41.5046)
45715Zulu Queen30414ValidL3.7200.0Found1976.033.98333-115.68333(33.98333, -115.68333)
\n", "

45716 rows × 10 columns

\n", "
" ], "text/plain": [ " name id nametype recclass mass (g) fall \\\n", "0 Aachen 1 Valid L5 21.0 Fell \n", "1 Aarhus 2 Valid H6 720.0 Fell \n", "2 Abee 6 Valid EH4 107000.0 Fell \n", "3 Acapulco 10 Valid Acapulcoite 1914.0 Fell \n", "4 Achiras 370 Valid L6 780.0 Fell \n", "... ... ... ... ... ... ... \n", "45711 Zillah 002 31356 Valid Eucrite 172.0 Found \n", "45712 Zinder 30409 Valid Pallasite, ungrouped 46.0 Found \n", "45713 Zlin 30410 Valid H4 3.3 Found \n", "45714 Zubkovsky 31357 Valid L6 2167.0 Found \n", "45715 Zulu Queen 30414 Valid L3.7 200.0 Found \n", "\n", " year reclat reclong GeoLocation \n", "0 1880.0 50.77500 6.08333 (50.775, 6.08333) \n", "1 1951.0 56.18333 10.23333 (56.18333, 10.23333) \n", "2 1952.0 54.21667 -113.00000 (54.21667, -113.0) \n", "3 1976.0 16.88333 -99.90000 (16.88333, -99.9) \n", "4 1902.0 -33.16667 -64.95000 (-33.16667, -64.95) \n", "... ... ... ... ... \n", "45711 1990.0 29.03700 17.01850 (29.037, 17.0185) \n", "45712 1999.0 13.78333 8.96667 (13.78333, 8.96667) \n", "45713 1939.0 49.25000 17.66667 (49.25, 17.66667) \n", "45714 2003.0 49.78917 41.50460 (49.78917, 41.5046) \n", "45715 1976.0 33.98333 -115.68333 (33.98333, -115.68333) \n", "\n", "[45716 rows x 10 columns]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(file_name)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Agregamos algunas variables \"falsas\"** para ilustrar las capacidades de creación de perfiles de pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[pandas.to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas-to-datetime) Convierte la columna \"year\" de un DataFrame df al formato de tipo datetime. `errors=\"coerce\"` indica que si se encuentra algún valor que no pueda convertirse en un formato datetime válido (como texto irreconocible o valores vacíos), pandas reemplazará esos valores con NaT (Not a Time), en lugar de generar un error." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas._libs.tslibs.timestamps.Timestamp" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#type(df[\"year\"]) # pandas.core.series.Series\n", "#type(df[\"year\"][0]) # numpy.float64\n", "df[\"year\"] = pd.to_datetime(df[\"year\"], format='%Y', errors=\"coerce\")\n", "type(df[\"year\"][0]) # pandas._libs.tslibs.timestamps.Timestamp\n", "#df[\"year\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ejemplo: variable constante" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "df[\"source\"] = \"NASA\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Agrega una nueva columna llamada \"boolean\" al DataFrame df con valores aleatorios True o False generados usando numpy. La cantidad de valores generados es igual al número de filas en el DataFrame (`df.shape[0]`)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "df[\"boolean\"] = np.random.choice([True, False], df.shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Crea una nueva columna llamada \"mixed\" en el DataFrame df, donde los valores de la columna son seleccionados aleatoriamente entre 1 y \"A\". Cada fila del DataFrame recibe uno de estos valores, escogido al azar." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "df[\"mixed\"] = np.random.choice([1, \"A\"], df.shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Crea una nueva columna llamada \"reclat_city\" en el DataFrame df. A cada valor existente en la columna \"reclat\" (que contiene la latitud registrada de los meteoritos), se le suma un valor generado aleatoriamente mediante una distribución normal (Gaussiana). Los valores generados aleatoriamente tienen una desviación estándar (scale) de 5. Esto significa que los valores estarán normalmente distribuidos alrededor de 0, con la mayoría dentro de ±5.El tamaño de los valores generados coincide con la longitud del DataFrame (len(df)), para asegurar que se agregue un valor a cada fila. En términos prácticos, esta operación puede simular una desviación o ruido geográfico alrededor de la latitud original, útil para análisis geoespaciales que requieran variabilidad." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "df[\"reclat_city\"] = df[\"reclat\"] + np.random.normal(scale=5, size=(len(df)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ejemplo: observaciones duplicadas.\n", "\n", "Selecciona las primeras 10 filas del DataFrame df usando `iloc[0:10]` y crea un nuevo DataFrame llamado duplicates_to_add, que es una copia de esas primeras 10 filas seleccionadas." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "duplicates_to_add = pd.DataFrame(df.iloc[0:10])\n", "duplicates_to_add[\"name\"] = duplicates_to_add[\"name\"] + \" copy\"\n", "\n", "df = pd.concat([df, duplicates_to_add], ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameidnametyperecclassmass (g)fallyearreclatreclongGeoLocationsourcebooleanmixedreclat_city
0Aachen1ValidL521.0Fell1880-01-0150.775006.08333(50.775, 6.08333)NASAFalseA47.245529
1Aarhus2ValidH6720.0Fell1951-01-0156.1833310.23333(56.18333, 10.23333)NASATrueA49.824794
2Abee6ValidEH4107000.0Fell1952-01-0154.21667-113.00000(54.21667, -113.0)NASATrue152.399247
\n", "
" ], "text/plain": [ " name id nametype recclass mass (g) fall year reclat \\\n", "0 Aachen 1 Valid L5 21.0 Fell 1880-01-01 50.77500 \n", "1 Aarhus 2 Valid H6 720.0 Fell 1951-01-01 56.18333 \n", "2 Abee 6 Valid EH4 107000.0 Fell 1952-01-01 54.21667 \n", "\n", " reclong GeoLocation source boolean mixed reclat_city \n", "0 6.08333 (50.775, 6.08333) NASA False A 47.245529 \n", "1 10.23333 (56.18333, 10.23333) NASA True A 49.824794 \n", "2 -113.00000 (54.21667, -113.0) NASA True 1 52.399247 " ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Informe en línea sin guardar objeto" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Summarize dataset: 100%|██████████████████████████████████| 49/49 [00:04<00:00, 10.09it/s, Completed]\n", "Generate report structure: 100%|███████████████████████████████████████| 1/1 [00:04<00:00, 4.80s/it]\n", "Render HTML: 100%|█████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.52it/s]\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "report = df.profile_report()\n", "report" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Guardar informe en archivo" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Export report to file: 100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 82.80it/s]\n" ] } ], "source": [ "report.to_file(\"meteorites.html\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }