Posts Data Visualization with NOAA Climate Data Set
Post
Cancel

Data Visualization with NOAA Climate Data Set

In this assignment, we will be working with the NOAA climate data set to create some interesting visualizations.

\(\S 1.\) Database

There are three tables that we need to create within a database which are stations, countries, and temperatures. In order to generate a database, we need the help from sqlite3 which is a module that allows us to conveniently create and query databases.

1
2
3
4
import sqlite3

# make a connection
conn = sqlite3.connect("noaa.db")

First, let’s access the stations data set with pandas.

1
2
3
4
5
import pandas as pd

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)
stations.head()
IDLATITUDELONGITUDESTNELEVNAME
0ACW0001160457.766711.866718.0SAVE
1AE00004119625.333055.517034.0SHARJAH_INTER_AIRP
2AEM0004118425.617055.933031.0RAS_AL_KHAIMAH_INTE
3AEM0004119425.255055.364010.4DUBAI_INTL
4AEM0004121624.430054.47003.0ABU_DHABI_BATEEN_AIR

Now, we can add this table to our database through the connection conn that we defined earlier.

1
stations.to_sql("stations", conn, if_exists = "replace", index = False)

We proceed similarly with the country data set.

1
2
3
4
5
url = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries = pd.read_csv(url)
# change the column "Name" for easier reference later on
countries = countries.rename(columns = {"Name": "Country"})
countries.head()
FIPS 10-4ISO 3166Country
0AFAFAfghanistan
1AX-Akrotiri
2ALALAlbania
3AGDZAlgeria
4AQASAmerican Samoa

To avoid potential issues with the spacing in the column names, let’s rename them in a more friendly and convenient format to work with.

1
2
3
4
5
# we don't need ISO column, so we will drop it
countries = countries.drop(["ISO 3166"], axis = 1)
# rename the column with name with space in it
countries = countries.rename(columns = {"FIPS 10-4": "FIPS"})
countries.head()
FIPSCountry
0AFAfghanistan
1AXAkrotiri
2ALAlbania
3AGAlgeria
4AQAmerican Samoa

We add the table to the database as usual.

1
countries.to_sql("countries", conn, if_exists = "replace", index = False)

For the temperature data set, we will load it from our local directory directly instead of accessing it through an url as the data from the source is already splitted/categorized into decades.

1
2
temperatures = pd.read_csv("temps.csv")
temperatures.head()
IDYearVALUE1VALUE2VALUE3VALUE4VALUE5VALUE6VALUE7VALUE8VALUE9VALUE10VALUE11VALUE12
0ACW000116041961-89.0236.0472.0773.01128.01599.01570.01481.01413.01174.0510.0-39.0
1ACW000116041962113.085.0-154.0635.0908.01381.01510.01393.01163.0994.0323.0-126.0
2ACW000116041963-713.0-553.0-99.0541.01224.01627.01620.01596.01332.0940.0566.0-108.0
3ACW00011604196462.0-85.055.0738.01219.01442.01506.01557.01221.0788.0546.0112.0
4ACW00011604196544.0-105.038.0590.0987.01500.01487.01477.01377.0974.031.0-178.0

At this point, there are some data manipulation that we have to do so that the querying process is easier to deal with later on.

1
2
3
4
5
6
7
8
# set index
temperatures = temperatures.set_index(keys = ["ID", "Year"])
# stack data based on months
temperatures = temperatures.stack()
# reset index
temperatures = temperatures.reset_index()
# let's take a look
temperatures.head()
IDYearlevel_20
0ACW000116041961VALUE1-89.0
1ACW000116041961VALUE2236.0
2ACW000116041961VALUE3472.0
3ACW000116041961VALUE4773.0
4ACW000116041961VALUE51128.0

Now, looking at the table above, we can immediately observe that the name of the “month” and “temperature” columns do not make much sense, so let’s change them and also the value within the month column as well. In addition, since the temperature is 100x of its unit degree Celsius, we divide the column accordingly.

1
2
3
4
5
6
7
8
9
# rename some of the columns
temperatures = temperatures.rename(columns = {"level_2": "Month", 0: "Temp"})
# change VALUE (of month) to integer represent month in a year
temperatures["Month"] = temperatures["Month"].str[5:].astype(int)
# add a FIPS column for querying database later on
temperatures["FIPS"] = temperatures["ID"].str[:2]
# convert temperature to (C)
temperatures["Temp"] = temperatures["Temp"] / 100
temperatures.head()
IDYearMonthTempFIPS
0ACW0001160419611-0.89AC
1ACW00011604196122.36AC
2ACW00011604196134.72AC
3ACW00011604196147.73AC
4ACW000116041961511.28AC

Things look pretty good here! We can now add this data set to our database as before.

1
temperatures.to_sql("temperatures", conn, if_exists = "replace", index = False)

To verify whether our database does indeed contain the three data sets, we use a sql cursor, .cursor, to interact with it.

1
2
3
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
1
[('stations',), ('countries',), ('temperatures',)]

And that’s what we exactly want to be outputted. Finally, as a safeguard, we close the connection to the database to avoid any undesirable/unexpected situations such as leak, etc.

1
conn.close()

Let’s now move on to the next part of this post…

\(\S 2.\) Query the Database

In this section, we will define a function in which it accepts some certain information that we need to query on the data base and return the corresponding Pandas dataframe with data from the three tables we have in the database.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def query_climate_database(country, year_begin, year_end, month):
    """
    FUNCTION
    --------
    Generate a dataframe which provides us with information of stations,
    latitude, longitude, and average temperatures from a set of conditions
    set by the user
    
    PARAMETERS
    ----------
    country   : name of a country (string)
    year_begin: starting year (int)
    year_end  : ending year (int)
    month     : the only month included in the dataframe
    
    RETURN
    ------
    A dataframe filled with info specified by the user along with the
    corresponding data/columns (station, latitude, longitude, country,
    year, month, temperature)
    """
    
    cmd = \
    f"""
    SELECT S.name, S.latitude, S.longitude, C.country, T.year, T.month, T.temp
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C ON T.fips = C.fips
    WHERE C.country = ?
        AND T.year BETWEEN ? AND ?
        AND month = ?
    """
    # remember to open the connection before accessing the database
    conn = sqlite3.connect("noaa.db")
    # query the database
    df = pd.read_sql_query(cmd, conn, params=(country, year_begin, year_end, month))
    # close the connection
    conn.close()
    return df

For example, let’s say we want to take a look at India from 1980 to 2020 in the month of January.

1
2
3
4
query_climate_database(country = 'India', 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
NAMELATITUDELONGITUDECountryYearMonthTemp
0PBO_ANANTAPUR14.58377.633India1980123.48
1PBO_ANANTAPUR14.58377.633India1981124.57
2PBO_ANANTAPUR14.58377.633India1982124.19
3PBO_ANANTAPUR14.58377.633India1983123.51
4PBO_ANANTAPUR14.58377.633India1984124.81
........................
3147DARJEELING27.05088.270India198315.10
3148DARJEELING27.05088.270India198616.90
3149DARJEELING27.05088.270India199418.10
3150DARJEELING27.05088.270India199515.60
3151DARJEELING27.05088.270India199715.70

3152 rows × 7 columns

Fantastic! Our function does work as we expected.

\(\S 3.\) Geographic Scatter Plot

For this section, we will define a function which answers the following question:

How does the average yearly change in temperature vary within a given country?

First, we have to get a sense of how to compute the average yearly change in temperature of each station. Recall from PIC16A that there is a tool oftentimes used for such a task which is linear regression. Mathematically speaking, the coefficient of the line of best fit is in fact an estimate of the yearly temperature change.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from sklearn.linear_model import LinearRegression
import numpy as np

def coef(df):
    """
    Estimate the coefficient, which is also the estimate of the average
    yearly change in temperature.
    -------------------------------
    Input: a pandas dataframe
    Ouput: the linear coefficient
    """
    
    # X needs to be a pandas dataframe
    X = df[["Year"]]
    # y needs to be a pandas series
    y = df["Temp"]
    LR = LinearRegression()
    LR.fit(X, y)
    return np.round(LR.coef_[0], 3)

To visually answer the main question, we will need the help from plotly to produce an interactive scatter plot which showcases the the yearly change of temperature at a specific time for each station within a given country.

A demonstration of coding documentation and the use of <br /> in long title that I suggested to my classmate.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from plotly import express as px

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    """
    FUNCTION
    --------
    Generate an interactive geographic scatter plot that shows the
    yearly change in temperature at a given time and contry specified
    by the user
    
    PARAMETERS
    ----------
    country   : name of the country (string)
    year_begin: starting year (int)
    year_end  : ending year(int)
    month     : month (int)
    min_obs   : minimum required number of years for a station (int)
    
    RETURN
    ------
    the produced interactive scatter plot
    """
    
    # collect the data
    df = query_climate_database(country, year_begin, year_end, month)
    # filter data that satisfies the minimum observation requirements
    df = df.groupby(["NAME"]).filter(lambda x: len(x) >= min_obs)
    # compute the average yearly change in temperature
    coefs = df.groupby(["NAME", "LATITUDE", "LONGITUDE", "Month"]).apply(coef)
    coefs = coefs.reset_index()
    # rename the column with info about the yearly rate
    coefs = coefs.rename(columns = {0: "Estimated Yearly<br>Change (C)"})
    # time to plot
    fig = px.scatter_mapbox(coefs,
                            lat = "LATITUDE",
                            lon = "LONGITUDE",
                            hover_name = "NAME",
                            title = "Estimates of Yearly Average Change of Temperature<br>in "
                                    + country + " stations from " + str(year_begin)
                                    + "-" + str(year_end) + " in " + month
                            color = "Estimated Yearly<br>Change (C)",
                            **kwargs)
    fig.update_layout(margin={"r":0,"t":55,"l":0,"b":0})
    return fig
1
2
3
4
5
6
7
8
9
# choose a colormap
color_map_1 = px.colors.sequential.Turbo

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2.4,
                                   mapbox_style = "carto-positron",
                                   color_continuous_scale = color_map_1)
fig.show()


Such a cool map we have produced! Observe that a majority of stations in India over this time period have a slight increase/decrease in yearly temperature. Nevertheless, there are still a small number of cases where the yearly change is very alarming, i.e. the temperature changes of these stations are significantly higher than their neighborhood. Now, let’s do some even cooler visualization while exploring these data sets.

\(\S 4.\) Geographic Coordinates with 3D Scatter Plots

For this section of the post, I will try to construct a 3d scatter plot in an attempt to address the following question:

How does temperature vary with respect to latitude and longitude (geographic coordinates) in a given year?

As this is a rather straightforward question to answer, we can certainly define a simple query function to get the desired data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def query_climate_database_3d(year, lat_min, lat_max, lon_min, lon_max):
    """
    Generate a pandas dataframe by querying the database with sql
    restricted to a certain set of conditions set by the user.
    ------------
    Input
    year   : year (int)
    lat_min: minimum latitude (float)
    lat_max: maximum latitude (float)
    lon_min: minimum longitude (float)
    lon_max: maximum longitude (float)
    
    Output
    A pandas dataframe that satisfies those conditions
    """
    cmd = \
    f"""
    SELECT S.latitude, S.longitude, T.temp, T.month
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    WHERE T.year = ?
        AND S.latitude BETWEEN ? AND ?
        AND S.longitude BETWEEN ? AND ?
    """
    # open the connection
    conn = sqlite3.connect("noaa.db")
    # query the database
    df = pd.read_sql_query(cmd, conn, 
                           params=(year, lat_min, lat_max, lon_min, lon_max))
    # close the connection
    conn.close()
    return df

We’re ready now to get started with implementing the plotting function.

I want to say thank to an anonymous friend for suggesting to add docstring to this function!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def geo_temp_3d_plot(year, lat_min, lat_max, lon_min, lon_max, **kwargs):
    """
    FUNCTION
    --------
    Generate a 3D interactive plot demonstrating the relationship
    between geographic coordinates and temperatures with respect to
    each month in a year
    
    PARAMETERS
    ----------
    year    : desired year (int)
    lat_min : minimum latitude (float)
    lat_max : maximum latitude (float)
    lon_min : minimum longitude (float)
    lon_max : maximum longitude (float)
    **kwargs: additional arguments in px.scatter_3d()
    
    RETURN
    ------
    the interactive 3d figure generated by px.scatter_3d()
    """
    geo = query_climate_database_3d(year, lat_min, lat_max, lon_min, lon_max)
    fig = px.scatter_3d(geo,
                        x = "LATITUDE",
                        y = "LONGITUDE",
                        z = "Month",
                        labels = {"Temp": "Temperature (C)"},
                        color = "Temp",
                        **kwargs,)
    fig.update_layout(margin={"r":0,"t":25,"l":0,"b":15},
                      title = {
                          "text": "Temperatures with respect to Geographic Coordinates in "
                          + str(year),
                          "y": .95,
                          "x": .4,
                          "xanchor": "center",
                          "yanchor": "top",
                      },
                      title_font_size = 20,
                      title_font_color = "DarkBlue")
    return fig

Let’s test our function.

1
2
3
4
5
6
color_map_2 = px.colors.diverging.Portland
fig = geo_temp_3d_plot(2000, -8, 8, -8, 8,
                 opacity = .7,
                 height = 530,
                 color_continuous_scale = color_map_2)
fig.show()


From the plot, it’s intuitive that the temperature is high from Jan to July and has a tendency to decrease rapidly onward from July to Dec which is certainly aligned with our common sense. With respect to the geographic coordinates, the lower the latitude the lower the temperature appears to be. In addition, it seems that in this specific plot, longitude does not have an observable effect with regard to temperature, though if we look closely enough, it seems like the temperature is cooler toward the two ends of the spectrum. For the last visualization of this blog post, let’s change gear a bit and meet our old friend seaborn.

\(\S 5.\) Time-Series Plot of a Station

For this section, we will utilize the power of relation plot and line plot from seaborn to tackle the following question

Given any station in the world, how does the average temperature there range in a certain time period?

Let’s now construct a function that would get us the data that we want.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def query_station(country, station, year_begin, year_end):
    """
    Generate a panadas dataframe of a specified station for a specified
    period of time
    --------------
    Input
    country   : name of a country (string)
    station   : name of a station within that country (string)
    year_begin: beginning year (int)
    year_end  : ending year (int)
    
    Output
    A pandas dataframe that satisfies those conditions
    """
    b = pd.DataFrame()
    # open the connection
    conn = sqlite3.connect("noaa.db")
    # 12 months in a year
    for i in range(1, 13):
        a = query_climate_database(country, year_begin, year_end, i)
        b = pd.concat([b, a])
    b = b[b["NAME"] == station].sort_values(by=["Year", "Month"], ignore_index=True)
    # close the connection
    conn.close()
    return b

Note: Because of some technical errors with accessing the database using a newly defined function (for the third time), I cannot implement a new query in this section. Instead, I will be recycling query_climate_database() from the first plot and add some nuances to the function to serve the purpose.

At this point, we’re fully prepared to construct a plotting function which would address the stated question.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import seaborn as sns
sns.set_theme(style="dark")
def mult_line_plot_station(country, station, year_begin, year_end):
    """
    FUNCTION
    --------
    Construct a group of plots of a station's temperatures within
    a given period of time
    
    PARAMETERS
    ----------
    country   : name of a country to inspect (string)
    station   : name of a station in that country (string)
    year_begin: beginning year (int)
    year_end  : ending year (int)
    
    RETURN
    ------
    A facet grids that showcases temperatures of the station
    within the given time period
    """
    
    # get the data
    s = query_station(country, station, year_begin, year_end)
    # plot the relational plot for each year
    g = sns.relplot(data=s, x="Month", y="Temp", col="Year", hue="Year",
                    kind="line", palette="husl", linewidth=2.5, zorder=5,
                    col_wrap=3, height=2.2, aspect=1.6, legend=False)
    
    # plot the the other year in the background for comparison
    for year, ax in g.axes_dict.items():
        # denote year (bold) at the top right of the plot
        ax.text(.81, .9, year, transform=ax.transAxes, fontweight="bold")
        # Plot every year's time series in the background
        sns.lineplot(data=s, x="Month", y="Temp", units="Year",
                     estimator=None, color=".75", linewidth=1, ax=ax)

    # set x-tick labels
    ax.set_xticks(ax.get_xticks()[::2])

    # add title, axis label and annotation
    g.fig.suptitle("Temperatures of " + station + " station"
                   + " from " + str(year_begin) + " to " + str(year_end))
    g.set_titles("")
    g.set_axis_labels("Month", "Temperature (C)")
    g.tight_layout()

Now, let’s check a station in Vietnam named TAN_SON_HOA from 1999 to 2010.

1
mult_line_plot_station("Vietnam", "TAN_SON_HOA", 1999, 2010)

png

Thank you Philip for helping me realize that I need to add more explanations on the meaning of this plot.

The series of graphs appears to fluctuate quite erratically, but a more thorough inspection reveals that the temperature, in general, has increased over time. Especially, if we observe more closely at the winter season, the temperature looks like it follows an indicative trend in which it increases quite rapidly. This is a rather solid indication of the realness and severity of climate change, which should be taken seriously by every single one of us.
This post is licensed under CC BY 4.0 by the author.