How Machine Learning Helps Us To Predict Wildfires

A step-by-step guide to predictive modeling in machine learning

Yeyu Huang
Level Up Coding

--

Image by author

In recent years, the global climate has experienced an increase in the frequency of extreme weather events, and the intensification of global warming has increased the possibility of climate-related disasters such as heat waves, droughts, and wildfires. From a global perspective, the high temperature in Spain and other countries in 2022 has broken records and caused wildfires.

I have written an article on the wildfire topic which demonstrate how to create an interactive dashboard with rich content on yearly wildfires in the United States:

Wildfires are unpredictable, unplanned, and uncontrolled, and may be caused by various reasons such as campfires, smoke & arson, fireworks, shooting, exploding, or lightning, and will be more prone to happen under extremely hot weather.

So in this article, let’s go one step ahead, that make a prediction of wildfires by using machine learning techniques.

1. Dataset Preparation

The dataset used for this demonstration comes from NASA FIRMS (Fire Information for Rescue Management System) which provides yearly summaries of active fire records in each country detected by satellites and measured by MODIS (Moderate Resolution Imaging Spectroradiometer). The MODIS active fire data product detects fires in 1 km pixels that are burning at the time of overpass under relatively cloud-free conditions using a contextual algorithm.

We can easily download the zip file by year from 2000 to 2021 on their website freely:

Each zip file contains CSV files for almost all the countries around the world. Today we are going to use data for Australia in 2021. Here is the download link to the file ”modis_2021_Australia.csv”.

2. Data Exploration

Firstly let’s import the Numpy and Pandas as well as the Warnings module to have a look at the wildfire records data in modis_2021_Australia.csv.

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
df = pd.read_csv("modis_2021_Australia.csv")

df.head()

From the MODIS website, it is straightforward to comprehend the definitions of the features in this dataset, so we shall elucidate some of them:

  • scan = the resolution of the scan
  • track = the resolution of the track
  • confidence = the confidence of fire incident, from 0% to 100%
  • bright_t31 = color temperature on tunnel 31, in K (Kelvin)
  • frp = fire radiant power in KKW
  • type = 0 presumed vegetation, 1 active volcano, 2 other static land sources, 3 offshore

In this demonstration, we regard frp as the target variable of prediction because the fire radiant power reflects the severity of the wildfire.

Let’s go a little deeper.

a) Data dimension

b) Data info

c) Features

d) Check if missing data

e) Basic statistics of the features

3. Data Analysis

Continuous Value Field Distribution Analysis

We analyze and visualize the data, mainly using Matplotlib and Seaborn libraries.

import matplotlib.pyplot as plt
import seaborn as sns

We first analyze the distribution of numerical data. The most direct method here is to use Pairplot on the entire dataset, which will analyze the distribution of each numerical dimension along the diagonal boxes and the distribution of each two-feature pair in other boxes.

sns.pairplot(df) 
plt.show()

Category Field Distribution Analysis

There are also some categorical field variables like satellite and daynight in our dataset, and we draw boxplots for some of them to see the data distribution characteristics.

plt.figure(figsize=(20, 12))
plt.subplot(2,2,1)
sns.boxplot(x = 'satellite', y = 'confidence', data = df)
plt.subplot(2,2,2)
sns.boxplot(x = 'daynight', y = 'confidence', data = df)

Correlation Analysis

We can use correlation calculation and heat map generated by Pandas to perform correlation analysis on the data, which are especially helpful for us to understand the correlation between target variables frp and other features, as follows:

If we want to know the detailed correlation between confidence and frp.

Or sort by frp:

df_topaffected=df.sort_values(by='frp',ascending=False) 
df_topaffected.head(10)

4. Data cleaning

The real world has a variety of data types, including unstructured data such as text, video, and images, as well as structured data including various missing values and error values. Before actually sending it to the model for training, we will do some data cleaning to improve the quality of the data. The specific operations include filling in missing values, cleaning irrelevant data, scaling and normalizing the data, etc.

a) Cleaning

Firstly, we thought that track, instrument and version are useless features in our training, so we simply drop them:

df = df.drop(['track'], axis = 1)
df = df.drop(['instrument', 'version'], axis = 1)

Then we found the feature satellite and daynight are not numeral data, so we have to transform them into numeral data:

df['satellite'] = df['satellite'].map({'Terra':0,'Aqua':1})
df['daynight'] = df['daynight'].map({'D':0,'N':1})

We also want to extract acq_date to create month instead:

df['month'] = df['acq_date'].apply(lambda x:int(x.split('-')[1]))
df = df.drop(['acq_date'], axis = 1)

See the result after cleaning:

df = df.sample(frac=0.2)
df = df.reset_index().drop("index", axis=1)
df.head()

b) Cleaned heat map

We shall remove the target prediction variable frp from the correlation heat map of features as well.

fire_df = df.drop(['confidence', 'frp'], axis = 1)
plt.figure(figsize=(10, 10))
sns.heatmap(fire_df.corr(),annot=True,cmap='viridis',linewidths=.5)

5. Data Split

For efficient modeling and model evaluation, we split the dataset into a training set (70% of the data) and a test set (30% of the data). We define fire_df as X set, and define frp as y set, in the training and testing process.

X = fire_df
y = df['frp']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

6. Evaluation Metrics

We use frp as the target label. Actually, there are different modeling methods. As this variable is continuous, so the challenge is a regression modeling challenge. We have the following evaluation indicators that can be used.

  • MSE — Mean Squared Error can be calculated from the mean squared difference of the actual and predicted values of the dataset
  • MAE — Mean Absolute Error can be calculated as the average difference between the actual and predicted values of the dataset
  • RMSE — Root Mean Square Error can be calculated as the square root of the mean squared difference between the actual and predicted values of the dataset, or we can say it is the square root value of the MSE.
  • The R-squared score — The R-squared score can be calculated from the equation given below:

7. Modeling and Estimation

a) Gradient Boosting Regression (GBR)

We utilize GBR for fitting, and can easily access the GradientBoostingRegressor class in Scikit-Learn, by leveraging this low-code package, we are not required to investigate deeply its principle and implementation for this application. Only a few steps need for the GBR modeling and prediction:

Import sklearn modules:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score

Model creation, training, and prediction (y_pred):

model1 = GradientBoostingRegressor(n_estimators = 100, learning_rate=0.1,
max_depth = 10, random_state = 0, loss = 'ls')
model1.fit(X_train, y_train)
y_pred = model1.predict(X_test)

Evaluation y_pred performance to the test set:

print ('MSE =',mse(y_pred, y_test))
print ('RSME =',np.sqrt(mse(y_pred, y_test)))
print ('MAE =',mae(y_pred, y_test))
print ('R2_score =',r2_score(y_pred, y_test))
print("GBR Accuracy, {:.5f}%".format(model1.score(X_test,y_test)*100))

Now we have comfortable results from the output

MSE = 626.0060940209611
RSME = 25.02011378912896
MAE = 4.692727951792414
R2_score = 0.9215091371543757
Performance ofGBR Model R^2 metric 0.99948
GBR Accuracy, 93.67964%

b) Decision Tree Regression

We can also use the decision tree regression method for modeling, the function in Scikit-Learn is defined in the DecisionTreeRegressor class.

It’s also required very few lines of code to do prediction as above section. Below is the code for implementation and evaluation:

from sklearn.tree import DecisionTreeRegressor as dtr
reg = dtr(random_state = 42)
reg.fit(X_train,y_train)
Y_pred = reg.predict(X_test)
print("MSE = ",mse(Y_pred, y_test))
print ('RSME =',np.sqrt(mse(Y_pred, y_test)))
print("MAE =",mae(Y_pred,y_test))
print("R2 score =",r2_score(Y_pred,y_test))
print("Decision Tree Regressor Accuracy, {:.5f}%".format(reg.score(X_test,y_test)*100))

Evaluation result:

MSE =  672.7535572356306
RSME = 25.937493272011285
MAE = 7.482475525831152
R2 score = 0.9183171750707861
Decision Tree Regressor Accuracy, 93.20767%

That’s it!

For the prediction of fire radiation power in Australia based on the MODIS data, we have completed the whole development process with useful tools. Hope you can find something helpful. Thanks for reading!

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

--

--

As a technical writer and consultant, I strive to bridge the gap between AI, language models, data science, Python and learners.