{ "cells": [ { "cell_type": "markdown", "id": "61ea077b", "metadata": { "id": "61ea077b" }, "source": [ "# Exploring TC PRIMED, Chapter 3: Pixel-Based Neural Network Application\n", "- Creators: Naufal Razin, Kathy Haynes, and Chris Slocum\n", "- Affiliation: CIRA and NESDIS/STAR\n", "\n", "---\n", "\n", "## Overview\n", "TC PRIMED contains passive microwave observations in their original swath structure. One of the ways in which users can apply a machine learning method to TC PRIMED swath data is by training a machine learning model for individual pixels. In this notebook, we will explore one way to pre-process TC PRIMED passive microwave data and apply a pixel-based machine learning method.\n", "\n", "## Prerequisites\n", "To successfully navigate and use this notebook, you should be familiar with:\n", "- the basics of Python programming such as loading modules, assigning variables, and list/array indexing\n", "- NetCDF files and NetCDF groups (Chapter 1a of this Learning Journey)\n", "- plotting data using `matplotlib`\n", "- TC PRIMED overpass file data structure (Chapter 1b of this Learning Journey)\n", "- basic performance metrics like mean squared error and mean absolute error\n", "\n", "## Learning Outcomes\n", "By working through this notebook, you will learn how to:\n", "- develop a pre-processing step for TC PRIMED passive microwave data for a pixel-based machine learning application\n", "- apply a pixel-based machine learning method to TC PRIMED passive microwave data" ] }, { "cell_type": "markdown", "id": "c97852f4-fdae-4722-9f29-37c112189477", "metadata": { "id": "a4c87fdb" }, "source": [ "## Background\n", "The Tropical Cyclone Precipitation, Infrared, Microwave, and Environmental Dataset (TC PRIMED) is a dataset centered around satellite passive microwave observations of tropical cyclones. TC PRIMED contains data from consistent sources, making it suitable for machine learning applications. For example, the passive microwave brightness temperature data in TC PRIMED has been corrected for any instrument drift and bias such that the observed brightness temperatures are consistent throughout the multiple years each satellite has been in orbit (Berg et al. 2016). Therefore, the differences in observed brightness temperatures are mainly due to differences in the characteristics of the different sensors in TC PRIMED.\n", "\n", "When developing any machine learning model, your dataset would usually consist of input feature(s) and target label(s). The input feature(s) is the input you would use to make your prediction and the target label(s) is what you want to predict. Since TC PRIMED stores passive microwave brightness temperature data in its original swath format, users can take one of two approaches to apply a machine learning model to the TC PRIMED swath data:\n", "- process the swath data (like brightness temperature) into a more uniform data structure and use that as the input feature to predict a storm-scale phenomenon (the target label, e.g., is there a hurricane eye in the passive microwave image?)\n", "- make pixel-based predictions by using the brightness temperature values at each pixel as the input feature to predict the target label within the same pixel (e.g., is there convective or stratiform precipitation in the passive microwave pixel?)\n", "\n", "In this notebook, we will go through an example of the second approach. And, in contrast to the *classification* examples provided above, we will employ a machine learning *regression*. In particular, we will use passive microwave brightness temperature observations from the Global Precipitation Measurement mission (GPM) satellite's Microwave Imager (GMI) sensor as our input feature to predict/regress the target label, which we have chosen to be the surface precipitation rate, at each pixel. Using GMI observations simplifies our example since:\n", "- it makes observations using a conical-scanning sensor, which ensures a constant resolution across the swath\n", "- its observations in the 10.65 GHz through 89.0 GHz frequencies are located on the same swath\n", "\n", "For our surface precipitation rate, we will use the NASA Goddard PROFiling algorithm (GPROF; Kummerow et al. 2015), also included in TC PRIMED. GPROF contains a surface precipitation rate variable that is available on the same swath as the passive microwave observations in the 10.65 through 89.0 GHz frequencies. The GPROF surface precipitation rate variable is in a unit of mm/hour.\n", "\n", "To recap, we will use the GPM passive microwave observations as our input features to predict/regress the surface precipitation rate at each pixel using GPROF as our target label. Note, however, GPROF uses a Bayesian approach, ancillary data such as reanalysis fields, and the passive microwave brightness temperature observations to obtain a precipitation profile within the passive microwave pixel. Therefore, **GPROF is not completely independent from the passive microwave observations.** We are simply using GPROF in this example to avoid additional layers of data pre-processing. Users can use their own independent dataset for a pixel-based prediction by regridding the dataset (e.g., observation from the GPM dual-frequency precipitation radar) onto the passive microwave swath. Figure 1 shows an example of the passive microwave brightness temperature observations and GPROF surface precipitation rate from Hurricane Irma (2017)." ] }, { "cell_type": "markdown", "id": "cbe439f0-9781-4f8d-b12c-f14d2eaded64", "metadata": { "id": "a4c87fdb" }, "source": [ "
\n", "\"Hurricane\n", "\n", "
Figure 1. Example of 36.64 GHz and 89.0 GHz brightness temperatures with GPROF surface precipitation rates from a GPM observation of Hurricane Irma (2017).
\n", " \n", "
" ] }, { "cell_type": "markdown", "id": "231d72d8", "metadata": { "id": "231d72d8" }, "source": [ "## Software\n", "This tutorial uses the Python programming language and packages. We will use:\n", "- `netCDF4` to read in the TC PRIMED file\n", "- `numpy` for simple array operations\n", "- `global_land_mask` to determine if a point is over the ocean\n", "- `tensorflow` for artificial neural network operations\n", "- `matplotlib` to plot the data" ] }, { "cell_type": "markdown", "id": "e73087dd", "metadata": { "id": "e73087dd" }, "source": [ "### Install Packages\n", "Let's first check if we have the necessary Python packages to run this notebook. If we don't, let's install them." ] }, { "cell_type": "code", "execution_count": null, "id": "yFrTx7GSQPt8", "metadata": { "id": "yFrTx7GSQPt8" }, "outputs": [], "source": [ "import subprocess, sys\n", "packages = [\"netCDF4\", \"global_land_mask\", \"matplotlib\",\n", " \"numpy\", \"tensorflow\"]\n", "for package in packages:\n", " try:\n", " __import__(package)\n", " except ImportError:\n", " subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])" ] }, { "cell_type": "markdown", "id": "9A1KpT1vQhO2", "metadata": { "id": "9A1KpT1vQhO2" }, "source": [ "Now, let's load the modules in the packages (e.g., `Dataset`) or load the packages and assign a shorter object name for the packages (e.g., `import numpy as np`) for a cleaner use throughout the notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "e5d3acb7", "metadata": { "id": "e5d3acb7" }, "outputs": [], "source": [ "from netCDF4 import Dataset\n", "import numpy as np\n", "from global_land_mask import globe\n", "import tensorflow as tf\n", "import tensorflow.keras as keras\n", "import tensorflow.keras.layers as layers\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['pcolor.shading'] = 'auto'" ] }, { "cell_type": "markdown", "id": "6779a4f9", "metadata": { "id": "6779a4f9" }, "source": [ "## Splitting Data Into Training, Validation, and Testing Sets\n", "As we have mentioned in the [background](#Background), your dataset would usually consist of input feature(s) and target label(s). To develop a good machine learning model, this pair of input feature(s) and target label(s) should also be separated into [training, validation, and testing sets](https://mlu-explain.github.io/train-test-validation/). In essence:\n", "- The training set is the dataset you would use to train your model.\n", "- The validation set is the dataset you would use to validate the performance of your trained model. By assessing the performance of the trained model on your validation set, you could decide to tweak the model hyperparameters to try to increase the model skill and/or reduce effects of underfitting or overfitting.\n", "- The testing set is the dataset used to evaluate the final model. The testing set should be completely independent from the training and validation process. **Only evaluate the model on the testing set as a final step, once you are happy with your model performance based on the validation set.**\n", "\n", "Typically, the training set encompasses 60% - 80% of your overall data, while the validation set encompasses 20% - 30% of your data. Equally important, the training, validation, and testing sets should not be from the same sample. In other words, if you want your model to predict tropical cyclone intensification and you use the intensity data from Hurricane Michael to train your model, neither your validation nor testing sets should have any intensity data from Hurricane Michael.\n", "\n", "With these criteria in mind, let's obtain the training, validation, and testing sets from TC PRIMED. When developing a machine learning model, we typically would want to maximize the number of samples that we use. However, for this notebook, we will use only a subset of TC PRIMED data for computing efficiency. Therefore,\n", "- for the training set, we will use a select few GPM observations from Hurricane Jose (2017).\n", "- for the validation set, we will use a select few GPM observations from Hurricane Maria (2017).\n", "- for the testing set, we will use select a few GPM observations from Hurricane Irma (2017)" ] }, { "cell_type": "markdown", "id": "lBu7dk2IdwN0", "metadata": { "id": "lBu7dk2IdwN0" }, "source": [ "## Read Files Online\n", "Let's retrieve information from the TC PRIMED files that we will use in this example. The TC PRIMED file will be from a GMI overpass of Hurricane Florence (2018). We will use the Python `netCDF4` and `requests` packages to read and retrieve the information directly from the TC PRIMED file available on an Amazon Web Service S3 bucket as part of the NOAA Open Data Dissemination program (NODD), without downloading the file, and store the information from the file in an \"instance\" type called `DS`." ] }, { "cell_type": "code", "execution_count": null, "id": "2cd15866", "metadata": { "id": "2cd15866" }, "outputs": [], "source": [ "import requests\n", "\n", "# Specify the URL to the Jose TC PRIMED folder on NODD\n", "TRAIN_NODD_URL = \"https://noaa-nesdis-tcprimed-pds.s3.amazonaws.com/v01r00/final/2017/AL/12/\"\n", "\n", "# Specify the names of the files we will use from the Jose TC PRIMED folder on NODD\n", "TRAIN_FILE_NAMES = [\n", " \"TCPRIMED_v01r00-final_AL122017_GMI_GPM_020015_20170906043506.nc\",\n", " \"TCPRIMED_v01r00-final_AL122017_GMI_GPM_020053_20170908155108.nc\",\n", " \"TCPRIMED_v01r00-final_AL122017_GMI_GPM_020062_20170909050809.nc\",\n", " \"TCPRIMED_v01r00-final_AL122017_GMI_GPM_020084_20170910153710.nc\",\n", " \"TCPRIMED_v01r00-final_AL122017_GMI_GPM_020093_20170911045711.nc\",\n", " \"TCPRIMED_v01r00-final_AL122017_GMI_GPM_020124_20170913044913.nc\",\n", " \"TCPRIMED_v01r00-final_AL122017_GMI_GPM_020130_20170913143513.nc\",\n", " \"TCPRIMED_v01r00-final_AL122017_GMI_GPM_020139_20170914035614.nc\",\n", " \"TCPRIMED_v01r00-final_AL122017_GMI_GPM_020170_20170916034716.nc\"\n", " ]\n", "\n", "# Declare TRAIN_DS as a dictionary\n", "TRAIN_DS = {}\n", "\n", "# Loop through each train file name\n", "for train_file_name in TRAIN_FILE_NAMES:\n", "\n", " # Join TRAIN_NODD_URL and train_file_name to produce a complete link\n", " # Retrieve the contents of the TC PRIMED file from the complete link\n", " url_response = requests.get(TRAIN_NODD_URL + train_file_name)\n", "\n", " # Load the contents of the TC PRIMED file in an \"instance\" type into the\n", " # dictionary TRAIN_DS with the file name as the dictionary key\n", " TRAIN_DS[train_file_name] = Dataset(train_file_name, memory=url_response.content)" ] }, { "cell_type": "code", "execution_count": null, "id": "35f015fd", "metadata": { "id": "35f015fd" }, "outputs": [], "source": [ "# Specify the URL to the Maria TC PRIMED folder on NODD\n", "VAL_NODD_URL = \"https://noaa-nesdis-tcprimed-pds.s3.amazonaws.com/v01r00/final/2017/AL/15/\"\n", "\n", "# Specify the names of the files we will use from the Maria TC PRIMED folder on NODD\n", "VAL_FILE_NAMES = [\n", " \"TCPRIMED_v01r00-final_AL152017_GMI_GPM_020200_20170918020118.nc\",\n", " \"TCPRIMED_v01r00-final_AL152017_GMI_GPM_020253_20170921122521.nc\",\n", " \"TCPRIMED_v01r00-final_AL152017_GMI_GPM_020284_20170923121423.nc\",\n", " \"TCPRIMED_v01r00-final_AL152017_GMI_GPM_020345_20170927101927.nc\",\n", " ]\n", "\n", "# Declare VAL_DS as a dictionary\n", "VAL_DS = {}\n", "\n", "# Loop through each val file name\n", "for val_file_name in VAL_FILE_NAMES:\n", "\n", " # Join VAL_NODD_URL and val_file_name to produce a complete link\n", " # Retrieve the contents of the TC PRIMED file from the complete link\n", " url_response = requests.get(VAL_NODD_URL + val_file_name)\n", "\n", " # Load the contents of the TC PRIMED file in an \"instance\" type into the\n", " # dictionary VAL_DS with the file name as the dictionary key\n", " VAL_DS[val_file_name] = Dataset(val_file_name, memory=url_response.content)" ] }, { "cell_type": "code", "execution_count": null, "id": "86270520", "metadata": { "id": "86270520" }, "outputs": [], "source": [ "# Specify the URL to the Irma TC PRIMED folder on NODD\n", "TEST_NODD_URL = \"https://noaa-nesdis-tcprimed-pds.s3.amazonaws.com/v01r00/final/2017/AL/11/\"\n", "\n", "# Specify the names of the files we will use from the Irma TC PRIMED folder on NODD\n", "TEST_FILE_NAMES = [\n", " \"TCPRIMED_v01r00-final_AL112017_GMI_GPM_019929_20170831163031.nc\",\n", " \"TCPRIMED_v01r00-final_AL112017_GMI_GPM_019938_20170901054701.nc\",\n", " \"TCPRIMED_v01r00-final_AL112017_GMI_GPM_020007_20170905165205.nc\",\n", " \"TCPRIMED_v01r00-final_AL112017_GMI_GPM_020038_20170907164107.nc\",\n", " ]\n", "\n", "# Declare TEST_DS as a dictionary\n", "TEST_DS = {}\n", "\n", "# Loop through each test file name\n", "for test_file_name in TEST_FILE_NAMES:\n", "\n", " # Join TEST_NODD_URL and test_file_name to produce a complete link\n", " # Retrieve the contents of the TC PRIMED file from the complete link\n", " url_response = requests.get(TEST_NODD_URL + test_file_name)\n", "\n", " # Load the contents of the TC PRIMED file in an \"instance\" type into the\n", " # dictionary VAL_DS with the file name as the dictionary key\n", " TEST_DS[test_file_name] = Dataset(test_file_name, memory=url_response.content)" ] }, { "cell_type": "markdown", "id": "395ba67e", "metadata": { "id": "395ba67e" }, "source": [ "## Data Pre-Processing: Generating Input Features and Target Labels\n", "Recall again that we want to predict the surface precipitation rate at each passive microwave pixel. To select the input features and target label, we have to be mindful of two things:\n", "- Land surfaces have varying microwave brightness temperature signatures due to varying land surface types. The subset of observations we have selected above is not big enough to sufficiently capture all the possible surface types, which would affect our model's ability to make predictions. Therefore, for this notebook, we will focus our model on predicting the surface precipitation rates over the more homogeneous ocean surface.\n", "- The passive microwave brightness temperature fields and GPROF surface precipitation rate field may have masked values. When training the model, we must exclude the masked values and only train the model when the selected passive microwave brightness temperature fields *and* the GPROF surface precipitation rate fields contain valid, non-masked values.\n", "\n", "The following points outline the steps we will have to take to pre-process TC PRIMED data to generate the input features and target label:\n", "- load the appropriate passive microwave brightness temperature fields as our input features\n", "- load the GPROF surface precipitation field as our target label\n", "- find the pixels that are over the ocean\n", "- find the pixels over the ocean that have valid, non-masked values for **all** our input features and target label\n", "- combine the input features and target label from all TC PRIMED files into their respective input features and target label arrays\n", "\n", "We incorporate these steps into a function below. If you wish to understand how to properly pre-process TC PRIMED data into pixel-based input features and target labels, understanding the function is crucial." ] }, { "cell_type": "code", "execution_count": null, "id": "9aebed67", "metadata": { "id": "9aebed67" }, "outputs": [], "source": [ "def generate_features_and_labels(FILE_DICT, FILE_NAMES, frequencies):\n", " \"\"\"\n", " Generate input features and target labels from a set of TC PRIMED GPM\n", " overpass files\n", "\n", " Parameters\n", " ----------\n", " FILE_DICT : dictionary\n", " dictionary containing the NetCDF instances for each file in FILE_NAMES\n", " FILE_NAMES: array-like\n", " list of TC PRIMED GPM file names from which we want to generate the input\n", " features and target labels\n", " frequencies : array-like\n", " list of GPM observation frequencies we want to use as the input\n", " features\n", "\n", " Returns\n", " -------\n", " features : array-like\n", " array of passive microwave brightness temperatures at each pixel as\n", " our input features\n", " labels : array-like\n", " array of GPROF surface precipitation rate at each pixel as our target\n", " labels\n", " \"\"\"\n", " # Value to replace masked data and to ignore when generating input\n", " # features and target labels\n", " _FillValue = -999.0\n", "\n", " # Since we don't usually know the size of our sample, we will generate an\n", " # empty array and continuously append values\n", "\n", " # Generate a dictionary of emtpy array to store brightness temperature\n", " # values for each observing frequency.\n", " feature_lists = {}\n", " for frequency in frequencies:\n", " feature_lists[frequency] = np.array([])\n", "\n", " # Generate empty array to store GPROF surface precipitation rate\n", " labels = np.array([])\n", "\n", " # Loop through each TC PRIMED GPM file\n", " for FILE_NAME in FILE_NAMES:\n", "\n", " # Load the latitude and longitude values of the observation pixels\n", " latitude = FILE_DICT[FILE_NAME][\"passive_microwave/S1/latitude\"][:]\n", " longitude = FILE_DICT[FILE_NAME][\"passive_microwave/S1/longitude\"][:]\n", "\n", " # The global_land_mask package uses longitude values in the range\n", " # [-180, 180) to determine if a point is over the ocean. We will\n", " # have to convert the default longitude values in TC PRIMED with a\n", " # longitude range of [0, 360) to the longitude range of [-180,180)\n", " longitude[longitude >= 180] -= 360\n", "\n", " # Load the surface preciptitation rate from GPROF\n", " # Replace the masked value with the specified _FillValue\n", " surfaceprecip = FILE_DICT[FILE_NAME][\"GPROF/S1/surfacePrecipitation\"][:].filled(_FillValue)\n", "\n", " # Using the latitude and longitude values of the pixels, find the\n", " # pixels that are over the ocean. Pixels that are over the ocean will\n", " # have a value of \"True\"\n", " ocean_pixels = globe.is_ocean(latitude, longitude)\n", "\n", " # Find the indices of the pixels that are over the ocean\n", " select_idx = np.argwhere(ocean_pixels == True)\n", "\n", " # Find the indices of the GPROF surface precipitation rate with a valid\n", " # value that is not the _FillValue. Append the indices to the indices of\n", " # pixels over the ocean\n", " select_idx = np.append(select_idx, np.argwhere(surfaceprecip != _FillValue), axis=0)\n", "\n", " # Find all unique indices\n", " # Select the unique indices only if there are more than one in the\n", " # original array (indicating an overlap between the indices for pixels\n", " # over the ocean and GPROF pixels that are not _FillValue)\n", " unique_idx, count_idx = np.unique(select_idx, axis=0, return_counts=True)\n", " select_idx = unique_idx[count_idx > 1]\n", "\n", " # Initialize dictionary to store brightness temperature fields\n", " TB = {}\n", "\n", " # Loop through each observation frequency we want to use as our\n", " # input feature\n", " for frequency in frequencies:\n", "\n", " # Load the brightness temperature field\n", " # Replace the masked value with the specified _FillValue\n", " TB[frequency] = FILE_DICT[FILE_NAME][\"passive_microwave/S1/TB_\" + frequency][:].filled(_FillValue)\n", "\n", " # Append indices of brightness temperature values that are not\n", " # _FillValue to the indices we have already selected above\n", " select_idx = np.append(select_idx, np.argwhere(TB[frequency] != _FillValue), axis=0)\n", "\n", " # Repeat the steps above\n", " # Iteratively select indices for pixels that are over the ocean,\n", " # have GPROF pixels that are not _FillValue, have brightness\n", " # temperature pixels that are not _FillValue, etc.\n", " unique_idx, count_idx = np.unique(select_idx, axis=0, return_counts=True)\n", " select_idx = unique_idx[count_idx > 1]\n", "\n", " # Loop through each index we have selected\n", " for idx in select_idx:\n", "\n", " # Append the value of GPROF surface precipitation rate into our\n", " # target label array\n", " labels = np.append(labels, surfaceprecip[idx[0],idx[1]])\n", "\n", " # Loop through each observation frequency\n", " for frequency in frequencies:\n", "\n", " # Append the value of the brightness temperature into our\n", " # input feature array\n", " feature_lists[frequency] = np.append(feature_lists[frequency], TB[frequency][idx[0],idx[1]])\n", "\n", " # Now, convert dictionary of input feature arrays into a 2-D array\n", " # of input features\n", " for frequency in frequencies:\n", "\n", " if frequency == frequencies[0]:\n", "\n", " features = feature_lists[frequency]\n", "\n", " else:\n", "\n", " features = np.vstack((features,feature_lists[frequency]))\n", "\n", " return np.transpose(features), labels" ] }, { "cell_type": "markdown", "id": "5ab8e8e9", "metadata": { "id": "5ab8e8e9" }, "source": [ "## Data Pre-Processing: Feature Selection\n", "Now that we have a function to generate our input features and target labels, let's generate the input features and target labels for our training, validation, and testing sets. During this process, we need to apply our understanding of the physical processes behind our dataset. Our target label is the surface precipitation rate. In order to predict our target label with some skill, we must use the input features that would provide the best information to make the prediction.\n", "\n", "For example, if we want to predict the maximum daily temperature at any time of the year, one of our input features may be the daily minimum solar zenith angle. Low solar zenith angle means that the sun is almost directly overhead. Therefore, the intensity of solar radiation would be stronger and we can expect the maximum daily temperature to be higher. The relationship between the daily minimum solar zenith angle and the maximum daily temperature is based on solid physical understanding of the system we want to predict, and should lead to a machine learning model with some skill. However, if we use a different input feature that is not based on solid physical understanding of the factors that affect the maximum daily temperature — say, the number of times your cat pushes your cup off of your coffee table in a day — then our model will not have any skill.\n", "\n", "NASA briefly describes how the [different GMI observation frequencies](https://gpm.nasa.gov/missions/GPM/GMI) relate to the different types of atmospheric processes. The following list shows the GMI observation frequencies we can use in this notebook:\n", "- 10.65V\n", "- 10.65H\n", "- 18.7V\n", "- 18.7H\n", "- 23.8V\n", "- 36.64V\n", "- 36.64H\n", "- 89.0V\n", "- 89.0H\n", "\n", "To start, let's use the 36.64 GHz and 89.0 GHz observation frequencies as our input features. The 89.0 GHz observation frequency is sensitive to the scattering of Earth radiation by large precipitation-sized ice particles in deep convection. The 36.64 GHz observation frequency is generally sensitive to emission of radiation by liquid precipitation, but can also be affected by scattering processes of large ice particles." ] }, { "cell_type": "code", "execution_count": null, "id": "535da41c", "metadata": { "id": "535da41c" }, "outputs": [], "source": [ "frequencies = [\"36.64V\", \"36.64H\", \"89.0V\", \"89.0H\"]" ] }, { "cell_type": "markdown", "id": "121302f3", "metadata": { "id": "121302f3" }, "source": [ "Generate input features and target labels for our trainining, validation, and input sets. **These will take a minute or two to run.**" ] }, { "cell_type": "code", "execution_count": null, "id": "f22d160d", "metadata": { "id": "f22d160d" }, "outputs": [], "source": [ "train_features, train_labels = generate_features_and_labels(TRAIN_DS, TRAIN_FILE_NAMES, frequencies)" ] }, { "cell_type": "code", "execution_count": null, "id": "db1575da", "metadata": { "id": "db1575da" }, "outputs": [], "source": [ "val_features, val_labels = generate_features_and_labels(VAL_DS, VAL_FILE_NAMES, frequencies)" ] }, { "cell_type": "code", "execution_count": null, "id": "cdd7c4d0", "metadata": { "id": "cdd7c4d0" }, "outputs": [], "source": [ "test_features, test_labels = generate_features_and_labels(TEST_DS, TEST_FILE_NAMES, frequencies)" ] }, { "cell_type": "markdown", "id": "bf397788", "metadata": { "id": "bf397788" }, "source": [ "## About the Machine Learning Model: The Artificial Neural Network\n", "Here, we provide a basic description of the [artificial neural network](https://www.ibm.com/topics/neural-networks). In *very* simplified terms, the artificial neural network is similar to linear regression but with additional non-linear steps. Let's briefly take a look at the linear regression as an analogy. A linear regression model takes the form of:\n", "$$y = mx + b$$\n", "Where $y$ is your predicted value, $x$ is your input value, $m$ is the weight, and $b$ is the bias. Now, let's say you have an artificial neural network model with two hidden layers. In the first hidden layer, the model takes the input values ($x$) and generates multiple subsets of linear regression models that each produce a predicted value ($y$). The number of linear regression models depends on the number of nodes in the first hidden layer, which is specified by the user. At this point, the model adds non-linearity by applying an activation function to the predicted values from the first hidden layer. This activation function is also specified by the user.\n", "\n", "After the activation function is applied to the predicted values from the first hidden layer, the predicted values become the input values for the next hidden layer, and the process repeats. In the final layer, no activation function is applied and the model linearly regresses the predicted values from the last hidden layer to the target label. During training, the artificial neural network finds the weight ($m$) and bias ($b$) values for each subset of linear regression models by minimizing a loss function (e.g., mean squared error).\n", "\n", "From the basic description of the artificial neural network model above, you have a few basic [model hyperparameters](https://www.youtube.com/watch?v=h291CuASDno) to choose:\n", "- the number of hidden layers\n", "- the number of nodes in each hidden layer\n", "- the activation function\n", "- the loss function\n", "\n", "In addition, there are a few other hyperparameters that we did not discuss, but still need to be selected:\n", "- the optimizer, which is a function that determines how the weights and biases are modified\n", "- the batch size, which tells the model how many samples from the full training set the model should train on before updating the weights and biases\n", "- the number of epochs, which tells the model how many times to train the model using the full training set" ] }, { "cell_type": "markdown", "id": "obi5T1LxOlCH", "metadata": { "id": "obi5T1LxOlCH" }, "source": [ "## But, Wait!\n", "The range of values for our data may vary by the data type. For our input features, brightness temperatures in the 89.0 GHz observation frequency may have a larger spread of possible values (for example, from 50 to 310 K) compared to the brightness temperature in the 36.64 GHz observation frequency. The large range of values in the data, and the varying range of values between the data types, can influence the [speed and stability of our model training](https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/), particularly for deep learning neural networks (as opposed to, say, a random forest model). Therefore, there is one additional pre-processing step that we have to take.\n", "\n", "## Data Pre-Processing: Normalizing the Input Data\n", "There are several ways in which we can compress the large range of values in our data and put them onto the same scale. The two most common ways are normalizing or standardizing our data. In this notebook, we will normalize our *input features* before we feed them to our model. Data normalization takes the form of:\n", "$$ x' = \\frac{x - min(x)}{max(x) - min(x)}$$\n", "Here, $x'$ is the normalized value of $x$, and $min(x)$ and $max(x)$ are the respective minimum value and maximum value of $x$. By normalizing the data, we constrain the values of each of our input features to be between zero and one. Typically, the values of $min(x)$ and $max(x)$ come from domain knowledge. But, in this notebook, we will simply obtain them from the training, validation, and testing sets. Let's generate a function to obtain the minimum and maximum values for *each* of our input features." ] }, { "cell_type": "code", "execution_count": null, "id": "FG1rFP_sdLiR", "metadata": { "id": "FG1rFP_sdLiR" }, "outputs": [], "source": [ "def obtain_feature_min_max(frequencies, train_features, val_features, test_features):\n", " \"\"\"\n", " Obtain min and max values of input features from the training, validation,\n", " and testing sets\n", "\n", " Parameters\n", " ----------\n", " frequencies : array-like\n", " list of GPM observation frequencies we want to use as the input\n", " features\n", " train_features : array-like\n", " 2-D array of input features from the training set\n", " val_features : array-like\n", " 2-D array of input features from the validation set\n", " test_features : array-like\n", " 2-D array of input features from the testing set\n", "\n", " Returns\n", " -------\n", " feature_min : dict\n", " dictionary containing the minimum value of each observation frequency\n", " in the input features\n", " feature_max : dict\n", " dictionary containing the maximum value of each observation frequency\n", " in the input features\n", " \"\"\"\n", "\n", " feature_min = {}\n", " feature_max = {}\n", "\n", " for freq_idx, frequency in enumerate(frequencies):\n", "\n", " all_features = np.hstack((train_features[:, freq_idx],\n", " val_features[:, freq_idx],\n", " test_features[:, freq_idx]))\n", "\n", " feature_min[frequency] = np.min(all_features)\n", " feature_max[frequency] = np.max(all_features)\n", "\n", " return feature_min, feature_max" ] }, { "cell_type": "markdown", "id": "-c1cfNlekB6h", "metadata": { "id": "-c1cfNlekB6h" }, "source": [ "Let's obtain the minimum and maximum values of each of our input features using the function we have generated. Then, let's just look at the range of values of the input features." ] }, { "cell_type": "code", "execution_count": null, "id": "FUKOrolEkL_T", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 141, "status": "ok", "timestamp": 1686607951161, "user": { "displayName": "Naufal Razin", "userId": "00945304052336708650" }, "user_tz": 360 }, "id": "FUKOrolEkL_T", "outputId": "7e2aa9c8-44f3-432d-af99-b13d2dd7eff5" }, "outputs": [], "source": [ "feature_min, feature_max = obtain_feature_min_max(frequencies, train_features, val_features, test_features)\n", "\n", "for frequency in frequencies:\n", "\n", " print(\"Observation frequency: \", frequency)\n", " print(\"Minimum value: \", feature_min[frequency])\n", " print(\"Maximum value: \", feature_max[frequency])\n", " print(\" \")" ] }, { "cell_type": "markdown", "id": "fjy8R_sokYR3", "metadata": { "id": "fjy8R_sokYR3" }, "source": [ "Now that we have the minimum and maximum value for each input feature, let's generate a function to normalize the input feature." ] }, { "cell_type": "code", "execution_count": null, "id": "huBM39Q_kXwH", "metadata": { "id": "huBM39Q_kXwH" }, "outputs": [], "source": [ "def normalize_input_features(frequencies, input_features, feature_min, feature_max):\n", " \"\"\"\n", " Normalize input features\n", "\n", " Parameters\n", " ----------\n", " frequencies : array-like\n", " list of GPM observation frequencies we want to use as the input\n", " features\n", " input_features : array-like\n", " 2-D array of input features\n", " feature_min : dict\n", " dictionary containing minimum value for each observing frequency\n", " feature_max : dict\n", " dictionary containing maximum value for each observing frequency\n", "\n", " Returns\n", " -------\n", " norm_input_features : array-like\n", " array of input features that have been normalized\n", " \"\"\"\n", "\n", " norm_input_features = np.zeros(input_features.shape)\n", "\n", " for freq_idx, frequency in enumerate(frequencies):\n", "\n", " norm_input_features[:,freq_idx] = (input_features[:,freq_idx] - feature_min[frequency]) / (feature_max[frequency] - feature_min[frequency])\n", "\n", " return norm_input_features" ] }, { "cell_type": "markdown", "id": "a-JzGls8qCAT", "metadata": { "id": "a-JzGls8qCAT" }, "source": [ "Now, let's obtain normalized input features for our training, validation, and testing sets." ] }, { "cell_type": "code", "execution_count": null, "id": "GqeZMwBHqKpy", "metadata": { "id": "GqeZMwBHqKpy" }, "outputs": [], "source": [ "norm_train_features = normalize_input_features(frequencies, train_features, feature_min, feature_max)" ] }, { "cell_type": "code", "execution_count": null, "id": "CB-mqJRUqeDd", "metadata": { "id": "CB-mqJRUqeDd" }, "outputs": [], "source": [ "norm_val_features = normalize_input_features(frequencies, val_features, feature_min, feature_max)" ] }, { "cell_type": "code", "execution_count": null, "id": "7x65ylqlqhgI", "metadata": { "id": "7x65ylqlqhgI" }, "outputs": [], "source": [ "norm_test_features = normalize_input_features(frequencies, test_features, feature_min, feature_max)" ] }, { "cell_type": "markdown", "id": "rNSg40jVrLHN", "metadata": { "id": "rNSg40jVrLHN" }, "source": [ "Now that we have normalized our input features, we are ready to set the model parameters and run it!\n", "\n", "## Setting Up the Model" ] }, { "cell_type": "code", "execution_count": null, "id": "66f193a9", "metadata": { "id": "66f193a9" }, "outputs": [], "source": [ "# Specify number of nodes in a hidden layer\n", "n_units = 128\n", "\n", "# Specify the activation function to use\n", "activation_func = \"relu\"\n", "\n", "# Initialize the model\n", "model = tf.keras.Sequential()\n", "\n", "# Add the first hidden layer with the specified number of nodes and activation\n", "# function\n", "# Also specify the number of input nodes, which is equivalent to the\n", "# number of input features that we have\n", "model.add(layers.Dense(n_units, activation=activation_func, input_shape=(len(frequencies),)))\n", "\n", "# Add a second hidden layer with the specified number of nodes and activation\n", "# function\n", "# No need to specify the number of input nodes, as the model knows this\n", "# from the previous hidden layer\n", "model.add(layers.Dense(n_units, activation=activation_func))\n", "\n", "# Add a final layer with one node for the output value that we want\n", "# to predict (our target label, the GPROF surface precipitation rate)\n", "# The activation function here must be linear\n", "model.add(layers.Dense(1,activation=\"linear\"))\n", "\n", "# Compile the model\n", "# Let's go with the Adam optimizer with a learning rate of 0.01\n", "# and a mean squared error loss function\n", "model.compile(optimizer=keras.optimizers.Adam(0.01), loss=\"mse\")" ] }, { "cell_type": "markdown", "id": "dFARUlFKr1fS", "metadata": { "id": "dFARUlFKr1fS" }, "source": [ "## Training the Model" ] }, { "cell_type": "code", "execution_count": null, "id": "53917f5c", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 83124, "status": "ok", "timestamp": 1686608045309, "user": { "displayName": "Naufal Razin", "userId": "00945304052336708650" }, "user_tz": 360 }, "id": "53917f5c", "outputId": "a8ca351a-5faa-48aa-9e99-c1ba99483925" }, "outputs": [], "source": [ "# Specify number of times (epochs) to train the model\n", "# Let's go with 50\n", "n_epochs = 50\n", "\n", "# Specify the size of the samples the model should train on before\n", "# updating the weights and biases in one epoch\n", "# Let's go with 5000\n", "batch_size = 5000\n", "\n", "# Train the model as an object called \"trained_model\"\n", "# Validate the model performance using the validation set\n", "trained_model = model.fit(norm_train_features, train_labels,\n", " validation_data=(norm_val_features,val_labels),\n", " epochs=n_epochs, batch_size=batch_size)" ] }, { "cell_type": "markdown", "id": "42d7cc2f", "metadata": { "id": "42d7cc2f" }, "source": [ "Note that if you re-run the cell directly above, the model will continue to be trained. In other words, training the model with 50 epochs twice means the model will be trained with 100 epochs.\n", "\n", "## Analyzing Model Convergence\n", "From running the cell above, you will notice that the output prints out the training and validation loss values at each epoch. Typically, a good model should *eventually* have similar training and validation loss values. In other words, after some number of epochs, a good model should have training and validation loss values that converge. This convergence generally means that the model is generalized enough that it performs just as well on the validation data (data it hasn't seen) as it does on the training data (data it has seen). We can plot the evolution of the training and validation loss values using the `trained_model` object. Let's do that!" ] }, { "cell_type": "code", "execution_count": null, "id": "TgvLFU_paRBx", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 11, "status": "ok", "timestamp": 1686608045310, "user": { "displayName": "Naufal Razin", "userId": "00945304052336708650" }, "user_tz": 360 }, "id": "TgvLFU_paRBx", "outputId": "f3c97f00-4c8a-47a8-cf82-acd9328266ad" }, "outputs": [], "source": [ "# Print out the name of the data\n", "# stored in trained_model\n", "trained_model.history.keys()" ] }, { "cell_type": "code", "execution_count": null, "id": "SZQHliTMcO9d", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 466 }, "executionInfo": { "elapsed": 565, "status": "ok", "timestamp": 1686608045868, "user": { "displayName": "Naufal Razin", "userId": "00945304052336708650" }, "user_tz": 360 }, "id": "SZQHliTMcO9d", "outputId": "b3f7c312-4b44-4cfd-b505-44663cd74bf7" }, "outputs": [], "source": [ "# Plot the training and validation loss values\n", "plt.plot(trained_model.history['loss'], label='Training')\n", "plt.plot(trained_model.history['val_loss'], label='Validation')\n", "plt.legend()\n", "plt.xlabel('Epoch')\n", "plt.ylabel('MSE')" ] }, { "cell_type": "markdown", "id": "6-FMyW_Wc8Fu", "metadata": { "id": "6-FMyW_Wc8Fu" }, "source": [ "From the plot of loss values above, you can see that the loss values do not converge. Rather, the loss values fluctuate wildly, especially for the validation data. This pattern of loss values indicate that our model is overfitting on the training data — it \"memorizes\" the training data instead of learning a general pattern that also applies to the validation data. At this stage, you would want to go back and change the model hyperparameters to constrain the model and prevent it from overfitting. But, since this is just an example exercise, we will proceed and discuss various aspects of the model setup along the way." ] }, { "cell_type": "markdown", "id": "klgrnhOZv4I9", "metadata": { "id": "klgrnhOZv4I9" }, "source": [ "## Analyzing Model Prediction\n", "Now, what does the relationship between the predicted surface precipitation rate values and the actual surface precipitation rate values look like? Let's make a prediction on the training and validation data using the model. Then, let's plot a scatter plot of the values." ] }, { "cell_type": "code", "execution_count": null, "id": "12ca51f6", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 20315, "status": "ok", "timestamp": 1686608066182, "user": { "displayName": "Naufal Razin", "userId": "00945304052336708650" }, "user_tz": 360 }, "id": "12ca51f6", "outputId": "99032db2-b691-438c-b159-8ad40e7943c0" }, "outputs": [], "source": [ "# Predict the surface precipitation rate from the training data\n", "train_predict = model.predict(norm_train_features)\n", "\n", "# Predict the surface precipitation rate from the validation data\n", "val_predict = model.predict(norm_val_features)" ] }, { "cell_type": "code", "execution_count": null, "id": "0fc1a4cc", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 504 }, "executionInfo": { "elapsed": 1098, "status": "ok", "timestamp": 1686608067267, "user": { "displayName": "Naufal Razin", "userId": "00945304052336708650" }, "user_tz": 360 }, "id": "0fc1a4cc", "outputId": "1a247d63-0857-4037-ca52-23defce47397" }, "outputs": [], "source": [ "# Plot a scatter plot for the training and validation predictions\n", "plt.figure(figsize=[12,6])\n", "\n", "# For the training set\n", "plt.subplot(1,2,1)\n", "plt.scatter(train_predict,train_labels, marker=\"+\", color=\"k\")\n", "\n", "# Plot a one-to-one line\n", "plt.plot([0,100],[0,100], color=\"red\", lw=2)\n", "\n", "plt.xlabel(\"Neural Network\\n Surface Precipitation Rate (mm/hr)\")\n", "plt.ylabel(\"GPROF Surface Precipitation Rate (mm/hr)\")\n", "\n", "plt.title(\"Training Set\")\n", "\n", "# For the validation set\n", "plt.subplot(1, 2, 2)\n", "plt.scatter(val_predict,val_labels, marker=\"+\", color=\"k\")\n", "\n", "# Plot a one-to-one line\n", "plt.plot([0, 100],[0, 100], color=\"red\", lw=2)\n", "\n", "plt.xlabel(\"Neural Network\\n Surface Precipitation Rate (mm/hr)\")\n", "\n", "plt.title(\"Validation Set\")" ] }, { "cell_type": "markdown", "id": "d3a8ec8a", "metadata": { "id": "d3a8ec8a" }, "source": [ "From the scatter plots, you can see that the neural network model under-predicts surface precipitation rates at the higher end of the surface precipitation rate distribution. In other words, when the actual surface precipitation rates are above 40 mm/hour, the model predicts surface precipitation of only around 20 mm/hour. There are a multitude of factors that may explain the model's skill here. Some of them include:\n", "- the training set that we have used here is not sufficiently big. Recall that we used observations from only nine overpasses of one storm. The sample size is insufficient to properly represent all possible distributions of surface precipitation rates.\n", "- we used only two observation frequencies on which to train the model, the 36.64 and 89.0 GHz. While the 36.64 and 89.0 GHz observation frequencies are important in predicting the surface precipitation rates, other observation frequencies may provide additional independent observations that can help improve the model skill.\n", "- we have not tuned the model hyperparameters to try to get the best model skill.\n", "\n", "Nonetheless, the examples thus far should provide you with a good baseline on which to train a machine learning model for pixel-based applications with TC PRIMED overpass files." ] }, { "cell_type": "markdown", "id": "249819ed", "metadata": { "id": "249819ed" }, "source": [ "### A Thought Experiment\n", "Look at the GPROF surface precipitation field in Figure 1. If you were to guess the *most common* surface precipitation rate value, what would it be? Since precipitation regions occupy only a small portion of the image, the most common surface precipitation rate value would be 0 mm/hour.\n", "\n", "Now, think about our chosen loss value — the mean squared error. If the model is trained on minimizing the mean squared error, how does the *most common* surface precipitation rate value affect the *mean* model error during training? Since most surface precipitation rate values are at or near 0 mm/hour, the model does not do well in predicting cases of really high surface precipitation rate events (e.g., above 30 mm/hour). The choice of the loss function explains, at least partly, the reason behind the model's under-prediction.\n", "\n", "This simple thought experiment highlight the importance of selecting the appropriate model loss function that best suits your purpose." ] }, { "cell_type": "markdown", "id": "jkdKZhKuWCs6", "metadata": { "id": "jkdKZhKuWCs6" }, "source": [ "
\n", "

Exercise 1

Continue to add one observation frequency at a time in the model training. How did adding more observation frequency impact the model skill?\n", "
" ] }, { "cell_type": "markdown", "id": "db69d48f", "metadata": {}, "source": [ "
\n", "

Exercise 2

Up to this point, we have not played around with the testing dataset. Rightly so. The testing dataset should not be touched until you have a final model! However, if you're curious about the model prediction on the testing dataset, repeat the steps above. Do the figures confirm what we understand about the model's deficiencies?\n", "
" ] }, { "cell_type": "markdown", "id": "e4b0114a", "metadata": { "id": "e4b0114a" }, "source": [ "## Applying the Artificial Neural Network Model to a TC PRIMED Overpass\n", "Let's say that you've trained the model, you're happy with the validation scores, and — as the very last step — confirmed your validation scores with the testing set. Congratulations! You have a working model! But now, how do you apply the model to a TC PRIMED overpass? Recall that:\n", "- our model only takes input features in the form of a two-dimensional array.\n", "- we trained the model to only make predictions over the ocean.\n", "- while we have taken care to exclude masked values in training our model, we cannot simply ignore missing values if we want to keep our data in its original shape and format.\n", "\n", "We would have to perform the following steps:\n", "- find the indices of all pixels over land\n", "- replace the masked value in the input features with a \\_FillValue\n", "- find the indices of all \\_FillValue in the input features\n", "- format the input features into a two-dimensional array\n", "- normalize the input features using the min and max values we have obtained above\n", "- run the model on the input features to make predictions, including on the \\_FillValue\n", "- re-format the model predictions to be consistent with the input format\n", "- use the indices of all pixels over land and \\_FillValue in the input features to replace the model predictions with a \\_FillValue\n", "\n", "Let's perform the steps on one overpass from our testing set above, which is an observation of Hurricane Irma (2017)." ] }, { "cell_type": "code", "execution_count": null, "id": "5aa8fc94", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5595, "status": "ok", "timestamp": 1686608090260, "user": { "displayName": "Naufal Razin", "userId": "00945304052336708650" }, "user_tz": 360 }, "id": "5aa8fc94", "outputId": "66c8eef1-d13f-4630-b108-835013950254" }, "outputs": [], "source": [ "# Retrieve a particular netCDF object from the test sample\n", "DS = TEST_DS[TEST_FILE_NAMES[2]]\n", "\n", "# Load the latitude and longitude values of the observation pixels\n", "latitude = DS[\"passive_microwave/S1/latitude\"][:]\n", "longitude = DS[\"passive_microwave/S1/longitude\"][:]\n", "\n", "# The global_land_mask package uses longitude values in the range\n", "# [-180, 180) to determine if a point is over the ocean. We will\n", "# have to convert the default longitude values in TC PRIMED with a\n", "# longitude range of [0, 360) to the longitude range of [-180,180)\n", "longitude[longitude >= 180] -= 360\n", "\n", "# Get the shape of the input\n", "input_shape = latitude.shape\n", "\n", "# Using the latitude and longitude values of the pixels, find the\n", "# pixels that are over the land. Pixels that are over the land will\n", "# have a value of \"True\"\n", "land_pixels = globe.is_land(latitude, longitude)\n", "\n", "# Find the indices of the pixels that are over land\n", "select_idx = np.argwhere(land_pixels == True)\n", "\n", "# Initialize array for input features\n", "test_features = []\n", "\n", "# Loop through each observation frequency\n", "for frequency in frequencies:\n", "\n", " # Load the brightness temperature field\n", " # Replace the masked value with -999.0\n", " TB = DS[\"passive_microwave/S1/TB_\" + frequency][:].filled(-999.0)\n", "\n", " # Append indices of brightness temperature values that are\n", " # _FillValue to the indices we have already selected above\n", " select_idx = np.append(select_idx, np.argwhere(TB == -999.0), axis=0)\n", "\n", " # Get only unique indices\n", " select_idx = np.unique(select_idx, axis=0)\n", "\n", " # Put the input features into an array\n", " if frequency == frequencies[0]:\n", "\n", " test_features = TB.flatten()\n", "\n", " else:\n", "\n", " test_features = np.vstack((test_features, TB.flatten()))\n", "\n", "# Transpose the input feature array for model ingest\n", "test_features = np.transpose(test_features)\n", "\n", "# Normalize test features\n", "norm_test_features = normalize_input_features(frequencies, test_features, feature_min, feature_max)\n", "\n", "# Load the x, y, and GPROF surface precipitation rates for plotting and\n", "# to compare with the model predictions\n", "# Use NaN value to replace masked value in GPROF surface precipitation\n", "# rate variable, for plotting\n", "x = DS[\"passive_microwave/S1/x\"][:]\n", "y = DS[\"passive_microwave/S1/y\"][:]\n", "GPROF_surfaceprecip = DS[\"GPROF/S1/surfacePrecipitation\"][:].filled(np.nan)\n", "\n", "# Make a prediction with the model\n", "model_surfaceprecip = model.predict(norm_test_features)\n", "\n", "# Reformat the output to match the input\n", "model_surfaceprecip = model_surfaceprecip.reshape(input_shape)\n", "\n", "# Loop through each index with land pixels or masked value in any of\n", "# the input features\n", "# Replace the model prediction with NaN for plotting\n", "for idx in select_idx:\n", "\n", " model_surfaceprecip[idx[0], idx[1]] = np.nan" ] }, { "cell_type": "code", "execution_count": null, "id": "53a1de2c", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 621 }, "executionInfo": { "elapsed": 1171, "status": "ok", "timestamp": 1686608094703, "user": { "displayName": "Naufal Razin", "userId": "00945304052336708650" }, "user_tz": 360 }, "id": "53a1de2c", "outputId": "66c2be00-80a4-481a-8e55-5d317804fbe3" }, "outputs": [], "source": [ "# Set figure size\n", "plt.figure(figsize=[12,8])\n", "\n", "# For the model surface precipitation prediction\n", "plt.subplot(1,2,1)\n", "plt.pcolormesh(x, y, model_surfaceprecip, vmin=0, vmax=50, cmap=\"viridis_r\")\n", "plt.colorbar(orientation=\"horizontal\",pad=0.13, label=\"ANN Surface Precipitation Rates (mm/hr)\")\n", "\n", "plt.xlim(-500,500)\n", "plt.xlabel(\"Zonal Distance from Tropical Cyclone Center (km)\")\n", "plt.ylim(-500,500)\n", "plt.ylabel(\"Meridional Distance from Tropical Cyclone Center (km)\")\n", "\n", "# For the GPROF surface precipitation field\n", "plt.subplot(1,2,2)\n", "plt.pcolormesh(x, y, GPROF_surfaceprecip, vmin=0, vmax=50, cmap=\"viridis_r\")\n", "plt.colorbar(orientation=\"horizontal\",pad=0.13, label=\"GPROF Surface Precipitation Rates (mm/hr)\")\n", "\n", "plt.xlim(-500,500)\n", "plt.xlabel(\"Zonal Distance from Tropical Cyclone Center (km)\")\n", "plt.ylim(-500,500)" ] }, { "cell_type": "markdown", "id": "754ab78f", "metadata": { "id": "754ab78f" }, "source": [ "As we can see from above, our artifical neural network model does really well at capturing the general structure of Hurricane Irma's precipitation field. However, as we have discussed above, the model could not capture the magnitude of the surface precipitation rates, when compared to GPROF. In addition, our model field contains a few areas with missing values. These areas correspond to pixels located over the islands of the Lesser Antilles. Since the pixels are over land and we trained our model to predict the surface precipitation rates over the ocean, we have replaced the model's prediction here with NaN. Finally, note that the GPROF swath is slightly narrower than the microwave swath, due to the algorithm design which we will not discuss." ] }, { "cell_type": "markdown", "id": "K96FAtHubeLI", "metadata": { "id": "K96FAtHubeLI" }, "source": [ "## Close the File\n", "When loading data from a NetCDF file, **always remember to close the file**. A best practice would be to close the file immediately after loading the variable or attribute of interest. However, since we're loading various variables throughout this notebook, we will close the files at the end of this notebook using the command below." ] }, { "cell_type": "code", "execution_count": null, "id": "2qnk0eVBbt4F", "metadata": { "id": "2qnk0eVBbt4F" }, "outputs": [], "source": [ "for train_file_name in TRAIN_FILE_NAMES:\n", " TRAIN_DS[train_file_name].close()\n", "\n", "for val_file_name in VAL_FILE_NAMES:\n", " VAL_DS[val_file_name].close()\n", "\n", "for test_file_name in TEST_FILE_NAMES:\n", " TEST_DS[test_file_name].close()" ] }, { "cell_type": "markdown", "id": "a425d438", "metadata": { "id": "a425d438" }, "source": [ "## Final Thoughts\n", "The example shown here demonstrates one way in which you can use TC PRIMED for machine learning applications — a pixel-based approach. More complex approaches, as we have noted above, require more data pre-processing such as re-gridding the observations onto a common, equidistant grid. We explore one method for re-gridding TC PRIMED in another notebook.\n", "\n", "In this tutorial, you learned to pre-process TC PRIMED overpass data for a pixel-based machine learning application. Subsequently, you learned how to set up your artificial neural network model and train it on your dataset. Then, you learned how to apply the model to a TC PRIMED overpass.\n", "\n", "The simple approaches adopted in this tutorial is meant to demonstrate a machine learning application. For greater success, you should include more data in training your model and experiment with the model hyperparameters that would get you the best model skill while not underfitting or overfitting on your data." ] }, { "cell_type": "markdown", "id": "77ce50a9", "metadata": { "id": "77ce50a9" }, "source": [ "## Data Statement\n", "- Razin, Muhammad Naufal; Slocum, Christopher J.; Knaff, John A.; Brown, Paula J. 2023. Tropical Cyclone PRecipitation, Infrared, Microwave, and Environmental Dataset (TC PRIMED). v01r00. NOAA National Centers for Environmental Information. https://doi.org/10.25921/dmy1-0595." ] }, { "cell_type": "markdown", "id": "afbad221", "metadata": { "id": "afbad221" }, "source": [ "## References\n", "- Berg, W., and Coauthors, 2016: Intercalibration of the GPM microwave radiometer constellation. J. Atmos. Oceanic Technol., 33, 2639–2654, https://doi.org/10.1175/JTECH-D-16-0100.1\n", "- Kummerow, C. D., D. L. Randel, M. Kulie, N.-Y. Wang, R. Ferraro, S. J. Munchak, and V. Petkovic, 2015: The evolution of the Goddard profiling algorithm to a fully parametric scheme. J. Atmos. Oceanic Technol., 32, 165–176, https://doi.org/10.1175/JTECH-D-15-0039.1" ] }, { "cell_type": "markdown", "id": "cb3e4863", "metadata": { "id": "cb3e4863" }, "source": [ "## Metadata\n", "- Language / package(s)\n", " - Python\n", " - NetCDF\n", " - Numpy\n", " - TensorFlow\n", " - Matplotlib\n", "- Domain\n", " - NOAA\n", " - NASA\n", "- Application keywords\n", " - Satellite passive microwave\n", " - NASA Goddard PROFiling Algorithm\n", "- Geophysical keywords\n", " - Tropical cyclones\n", " - Surface precipitation rate\n", "- AI keywords\n", " - Data pre-processing\n", " - Artificial Neural Network" ] }, { "cell_type": "code", "execution_count": null, "id": "afacdc60", "metadata": { "id": "afacdc60" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 5 }