ECHO-COVID19

Last updated: May 13th, 20202020-05-13Project preview

Tracking EPA's COVID-19 non-enforcement policy

In late March 2020, EPA released a memo announcing that it would not penalize regulated industries that fail to meet their monitoring and reporting requirements due to COVID-19. Specifically EPA has said that it:

"is not seeking penalties for noncompliance only in circumstances that involve routine monitoring and reporting requirements, if, on a case-by-case basis, EPA agrees that such noncompliance was caused by the COVID-19 pandemic."

This may have a number of public and environmental health impacts if facilities respond by increasing their emissions and discharges. See our response to EPA's memo here: https://envirodatagov.org/epas-covid-19-leniency-is-a-free-pass-to-pollute/

What is the effect of EPA's non-enforcement of Clean Air Act and Clean Water Act permits? Using this notebook, you can track how facilities' releases—as well as monitoring and reporting—of air and water hazards has changed in your state over the past few months, compared to previous years.

Essentially, there are three scenarios we may see playing out:

Monitoring and reporting violations

    1. Facilities that do not report (we can track this)....but do still meet their permit limits (yet we can't know this specifically, precisely because they didn't report)
    1. Facilities that do not report (we can track this)....and actually exceed their limits (yet we can't know this specifically, precisely because they didn't report)

Environmental violations

    1. Facilities that do meet their reporting obligations BUT they report having exceeded their permitted limits

Organization of this notebook:

  • Nationwide trends
    • Air emissions (from major sources?) (TBD)
    • Water discharges from major sources
    • Water quality monitoring and reporting violations
  • Local trends: zoom in on a state/zip of interest
    • Air emissions (from major sources?) (TBD)
    • Water discharges from major sources (TBD)
    • Water quality monitoring and reporting violations (TBD)

How to Run

  • If you click on a gray code cell, a little “play button” arrow appears on the left. If you click the play button, it will run the code in that cell (“running a cell”). The button will animate. When the animation stops, the cell has finished running. Where to click to run the cell
  • You may get a warning that the notebook was not authored by Google. We know, we authored them! It’s okay. Click “Run Anyway” to continue. Error Message
  • It is important to run cells in order because they depend on each other.
  • However, besides running the first cell below ("Setup"), you can run them section by section (e.g
  • Some cells-especially in the "Select your state" section-will create a dropdown menu after you run them. Be sure to make a selection before running the next cell. Dropdown menu
  • Other cells will simply print information when you run them, like this one: Simple cell
  • Run all of the cells in a Notebook to make a complete report. Please feel free to look at and learn about each result as you create it!

Setup

Here we load some helper code to get us going.

In [56]:
# Import libraries
import urllib.parse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import csv
import datetime
import folium

Air emissions monitoring

First: Are facilities even doing stack tests? Stack tests involve measuring the volume of pollutants coming out of the facility's "smokestack".

The following cell will grab EPA data on facility stack tests for every April on record (up to 50 years ago). Some pollutant releases may be seasonal, so by looking only at Aprils, we can account for this variation and ensure an apples-to-apples comparison.

In [57]:
sql = "select * from `ICIS-AIR_STACK_TESTS` where ACTUAL_END_DATE like '04/%'"
url='http://apps.tlt.stonybrook.edu/echoepa/?query='
data_location=url+urllib.parse.quote(sql)

data = pd.read_csv(data_location,encoding='iso-8859-1',header = 0)
data.set_index("pgm_sys_id", inplace=True)
data
Out[57]:
ACTIVITY_ID COMP_MONITOR_TYPE_CODE COMP_MONITOR_TYPE_DESC STATE_EPA_FLAG ACTUAL_END_DATE POLLUTANT_CODES POLLUTANT_DESCS AIR_STACK_TEST_STATUS_CODE AIR_STACK_TEST_STATUS_DESC
pgm_sys_id
CT0000000900300005 3400331189 CST Stack Test S 04/26/2007 Mercury NaN PSS Pass
CT0000000900300005 3400331252 CST Stack Test S 04/28/2009 TOTAL PARTICULATE MATTER NaN PSS Pass
CT0000000900300005 3400331300 CST Stack Test S 04/28/2009 NITROGEN OXIDES NO2 NaN PSS Pass
CT0000000900300005 3400331331 CST Stack Test S 04/28/2010 Mercury NaN PSS Pass
CT0000000900300005 3400331352 CST Stack Test S 04/28/2011 Mercury NaN PSS Pass
... ... ... ... ... ... ... ... ... ...
080000004904700106 3600949251 CST Stack Test E 04/28/2015 NaN NaN PSS Pass
PA000238561 3601939171 CST Stack Test S 04/26/2000 NaN NaN PSS Pass
IA0000001901700030 3600399426 CST Stack Test S 04/29/2015 NaN NaN PSS Pass
IL000097190AAE 3601478216 CST Stack Test S 04/30/2018 NaN NaN PSS Pass
AK0000000229000002 3600827835 CST Stack Test S 04/02/2015 NaN NaN PSS Pass

36809 rows × 9 columns

Now we'll chart this data.

The height of each bar will indicate how many tests there were, while the red line will show us the average number of these for all previous Aprils and the yellow line indicates the average for the past three years.

In [58]:
data['ACTUAL_END_DATE'] = pd.to_datetime(data['ACTUAL_END_DATE'], format="%m/%d/%Y") #format

stacks = data.groupby(['ACTUAL_END_DATE'])['STATE_EPA_FLAG'].count() 
# = total number of pollutants stack test. May want to summarize by facility instead. 
stacks = stacks.resample('M').sum() #resample to a monthly basis  
stacks = stacks.loc[(stacks.index.month == 4)] # Filter back to just Aprils
stacks = pd.DataFrame(stacks)
stacks = stacks.rename(columns={'STATE_EPA_FLAG': "Number of stack tests"})
stacks.index = stacks.index.strftime('%Y-%m') # makes the x axis (date) prettier

ax = stacks.plot(kind='bar', title = "Past Aprils—# of pollutants stack-tested", figsize=(20, 10), fontsize=16)
ax

#label trendline
trend=stacks['Number of stack tests'].mean()
ax.axhline(y=trend, color='r', linestyle='-')

#label past 4 Aprils trend (2017, 2018, 2019, 2020)
trend_month=pd.concat([stacks.loc["2017-04"],stacks.loc["2018-04"],stacks.loc["2019-04"]])
trend_month=trend_month['Number of stack tests'].mean()
ax.axhline(y=trend_month, color='y', linestyle='-')
Out[58]:
<matplotlib.lines.Line2D at 0x7fe6de6755b0>

What are facilities releasing in to the air?

This has ramifications for COVID-19....

First, we'll get this data!

In [59]:
sql = "select * from `ICIS-AIR_VIOLATION_HISTORY` where HPV_DAYZERO_DATE like '04-%'"
url='http://apps.tlt.stonybrook.edu/echoepa/?query='
data_location=url+urllib.parse.quote(sql)

data = pd.read_csv(data_location,encoding='iso-8859-1',header = 0)
data.set_index("pgm_sys_id", inplace=True)
data
Out[59]:
ACTIVITY_ID AGENCY_TYPE_DESC STATE_CODE AIR_LCON_CODE COMP_DETERMINATION_UID ENF_RESPONSE_POLICY_CODE PROGRAM_CODES PROGRAM_DESCS POLLUTANT_CODES POLLUTANT_DESCS EARLIEST_FRV_DETERM_DATE HPV_DAYZERO_DATE HPV_RESOLVED_DATE
pgm_sys_id
IA0000001901900052 3400381811 U.S. EPA NaN NaN 07000F0000190190005200026 HPV CAATVP Title V Permits 300000329 FACIL NaN 04-16-2003 07-24-2003
IL000031806AAY 3400337992 U.S. EPA NaN NaN 05000F0000170310238400015 HPV CAANESH National Emission Standards for Hazardous Air ... 300000183 Benzene NaN 04-26-1999 11-30-2004
IL000031600ASD 3400337995 U.S. EPA NaN NaN 05000F0000170310124900094 HPV CAATVP Title V Permits 300000243 VOLATILE ORGANIC COMPOUNDS (VOCS) NaN 04-16-2001 07-29-2003
IL000031096AKZ 3400338004 U.S. EPA NaN NaN 05000F0000170310048100018 HPV CAATVP Title V Permits 300000243 VOLATILE ORGANIC COMPOUNDS (VOCS) NaN 04-30-2001 03-31-2004
IL000119813AAI 3400338016 State IL NaN IL000A0000171190015300047 HPV CAASIP State Implementation Plan for National Primary... 300000319 PARTICULATE MATTER < 10 UM NaN 04-28-1989 06-09-1993
... ... ... ... ... ... ... ... ... ... ... ... ... ...
TX0000004820102042 3601846857 State TX NaN TX000A0970639792019114001 HPV CAATVP Title V Permits 300000329 FACIL 04-14-2019 04-14-2019 NaN
CABAA424710 3602050949 Local CA BAA CABAAA78846 HPV CAATVP Title V Permits NaN NaN 04-12-2016 04-12-2016 01-31-2018
CABAA424710 3602050960 Local CA BAA CABAAA78849 HPV CAATVP Title V Permits NaN NaN 04-12-2016 04-12-2016 01-31-2018
TX0000004820101886 3602092384 State TX NaN TX000A0729450122006130002 HPV CAATVP Title V Permits 300000329 FACIL 04-28-2006 04-28-2006 04-24-2007
TX0000004820101886 3602092389 State TX NaN TX000A0787300092008183002 HPV CAATVP Title V Permits 300000329 FACIL 04-24-2008 04-24-2008 03-03-2010

3428 rows × 13 columns

Let's chart it!

The height of each bar will indicate how many emissions violations there have been, while the red line will show us the average number of these for all previous Aprils and the yellow line indicates the average for the past three years.

In [60]:
data['HPV_DAYZERO_DATE'] = pd.to_datetime(data['HPV_DAYZERO_DATE'], format="%m-%d-%Y") #format

airviols = data.groupby(['HPV_DAYZERO_DATE'])['ENF_RESPONSE_POLICY_CODE'].count() 
# = total number of violations. May want to summarize by facility instead. 
airviols = airviols.resample('M').sum() #resample to a monthly basis  
airviols = airviols.loc[(airviols.index.month == 4)] # Filter back to just Aprils
airviols = pd.DataFrame(airviols)
airviols = airviols.rename(columns={'ENF_RESPONSE_POLICY_CODE': "Number of violations"})
airviols.index = airviols.index.strftime('%Y-%m') # makes the x axis (date) prettier

ax = airviols.plot(kind='bar', title = "Past Aprils—# of Air Violations", figsize=(20, 10), fontsize=16)
ax

#label trendline
trend=airviols['Number of violations'].mean()
ax.axhline(y=trend, color='r', linestyle='-')

#label past 4 Aprils trend (2017, 2018, 2019, 2020)
trend_month=pd.concat([airviols.loc["2017-04"],airviols.loc["2018-04"],airviols.loc["2019-04"]])
trend_month=trend_month['Number of violations'].mean()
ax.axhline(y=trend_month, color='y', linestyle='-')
Out[60]:
<matplotlib.lines.Line2D at 0x7fe6d74c1ca0>

Let's break it down by type of pollutant. We'll focus on two of what EPA calls "criteria pollutants" - those that have such an impact on human health, that the agency regulates how much of these can be in the atmosphere at any given time/place. The two are particulate matter, which is known to affect the circulatory and nervous systems, and sulfur dioxide, which exacerbates asthma.

In [61]:
data['HPV_DAYZERO_DATE'] = pd.to_datetime(data['HPV_DAYZERO_DATE'], format="%m-%d-%Y") #format
cps = data

cps = cps[(cps['POLLUTANT_DESCS'].astype(str).str.contains('Particulate')) | (cps['POLLUTANT_DESCS'].astype(str).str.contains('Sulfur'))]
cps_map = cps # for mapping later...

cps = cps.groupby(['HPV_DAYZERO_DATE'])['POLLUTANT_DESCS'].count() 
# = total number of violations. May want to summarize by facility instead. 

cps = cps.resample('M').sum() #resample to a monthly basis  
cps = cps.loc[(cps.index.month == 4)] # Filter back to just Aprils
cps = pd.DataFrame(cps)
cps = cps.rename(columns={'POLLUTANT_DESCS': "Number of violations"})
cps.index = cps.index.strftime('%Y-%m') # makes the x axis (date) prettier

ax = cps.plot(kind='bar', title = "Past Aprils—# of Violations Related to Particulate Matter and/or Sulfur Dioxide", figsize=(20, 10), fontsize=16)
ax

#label trendline
trend=cps['Number of violations'].mean()
ax.axhline(y=trend, color='r', linestyle='-')

#label past 4 Aprils trend (2017, 2018, 2019, 2020)
trend_month=pd.concat([cps.loc["2017-04"],cps.loc["2018-04"],cps.loc["2019-04"]])
trend_month=trend_month['Number of violations'].mean()
ax.axhline(y=trend_month, color='y', linestyle='-')
Out[61]:
<matplotlib.lines.Line2D at 0x7fe6daae8d00>

Where are these facilities that exceeded their PM and SO2 permits last month?

Even if, on the whole, there are fewer exceedances, the places that are emitting more pollutants are important to track. Their neighbors are suffering more.

In [62]:
latest = cps_map[(cps_map["HPV_DAYZERO_DATE"] >= '2020-04-01') & (cps_map["HPV_DAYZERO_DATE"] < '2020-05-01')]

if (len(latest.index)>0):
    #get facility information from ECHO
    sql = "select FAC_NAME, AIR_IDS, FAC_LAT, FAC_LONG, CAA_QTRS_WITH_NC" + \
        " from ECHO_EXPORTER where AIR_FLAG = 'Y' "
    url='http://apps.tlt.stonybrook.edu/echoepa/?query='
    data_location=url+urllib.parse.quote(sql)
    echo_data = pd.read_csv(data_location,encoding='iso-8859-1',header = 0)
    echo_data.set_index( 'AIR_IDS', inplace=True )
    
    #merge echo and npdes data
    latest = latest.join(echo_data)
    
else:
    print("Actually, there were no reporting violations for April")

Make the map!

The map shows us all the facilities that reporting emitting more than their permitted levels of PM and SO2 in April 2020.

In [63]:
# Filter to remove NaNs - missing data!
latest = latest[~(np.isnan(latest["FAC_LAT"])) | ~(np.isnan(latest["FAC_LONG"]))]

# Generate a scale by which we can classify facilities by quarters in non-compliance, and map them accordingly
latest["quartile"]=pd.qcut(latest["CAA_QTRS_WITH_NC"], 4, labels=False) # quartiles
radii={0:6, 1:10, 2:14, 3: 20}

def mapper(df):
    # Initialize the map
    m = folium.Map(
        location = [df.mean()["FAC_LAT"], df.mean()["FAC_LONG"]],
        zoom_start = 8
    )

    # Add a clickable marker for each facility
    for index, row in df.iterrows():
        folium.CircleMarker(
            location = [row["FAC_LAT"], row["FAC_LONG"]],
            popup = row["FAC_NAME"] + "\n Quarters in Non-Compliance over the past three years: "+str(row["CAA_QTRS_WITH_NC"]),
            radius = radii[row["quartile"]],
            color = "black",
            weight = 1,
            fill_color = "orange",
            fill_opacity= .4
        ).add_to(m)

    bounds = m.get_bounds()
    m.fit_bounds(bounds)

    # Show the map
    return m

map_of_facilities = mapper(latest)
map_of_facilities
Out[63]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Water pollutant discharges

NOTE: Because there are so many facilities that discharge into waters of the US, there's a lot of data! The following cell may take a little while to run.

In [64]:
sql = "select NPDES_ID, EXCEEDENCE_PCT, MONITORING_PERIOD_END_DATE, PARAMETER_DESC" + \
    " from NPDES_EFF_VIOLATIONS where EXCEEDENCE_PCT > 0 and MONITORING_PERIOD_END_DATE like '04/%'"
#NPDES_ID like '" + my_state + "%' and" +\
#filter to facilities with pollutant exceedences

url='http://apps.tlt.stonybrook.edu/echoepa/?query='
data_location=url+urllib.parse.quote(sql)

data = pd.read_csv(data_location,encoding='iso-8859-1',header = 0)
data.set_index("NPDES_ID", inplace=True)
exceeds = data 
exceeds
Out[64]:
EXCEEDENCE_PCT MONITORING_PERIOD_END_DATE PARAMETER_DESC
NPDES_ID
HI0110086 23 04/30/1993 Solids, total suspended
LA0067784 60 04/30/1997 Coliform, fecal general
LA0067784 60 04/30/1997 Coliform, fecal general
ID0020311 49 04/30/2005 Solids, total suspended
PR0025411 40 04/30/2003 Lead, total [as Pb]
... ... ... ...
KYG640153 27 04/30/2016 Solids, total suspended
PA0027103 10 04/30/2015 BOD, carbonaceous [20 day, 20 C]
OH0054488 80 04/30/2015 BOD, carbonaceous [5 day, 20 C]
OH0121355 63 04/30/2014 Oxygen, dissolved [DO]
MO0047350 25 04/30/2013 Chlorine, total residual

191313 rows × 3 columns

Let's chart this

Are facilities exceeding their permits more this April than previous Aprils? Like with air emissions and monitoring, we need to compare April-April because there is a seasonality to many discharges.

The height of each bar will indicate how many pollution permits have been exceeded, while the red line will show us the average number of these for all previous Aprils and the yellow line indicates the average for the past three years.

In [65]:
# First, let's look at the number of pollutant permits that were exceeded
exceeds['MONITORING_PERIOD_END_DATE'] = pd.to_datetime(exceeds['MONITORING_PERIOD_END_DATE'], format="%m/%d/%Y") #format

month = exceeds.groupby(['MONITORING_PERIOD_END_DATE'])['EXCEEDENCE_PCT'].count() # = total number of pollutants exceeded...# only need to keep one column. probably should rename...
month = month.resample('M').sum() #resample to a monthly basis and 
month = month.loc[(month.index.month == 4)] # Filter back to just Aprils
month = pd.DataFrame(month)
month = month.rename(columns={'EXCEEDENCE_PCT': "Number of pollution permits exceeded"})
month.index = month.index.strftime('%Y-%m') # makes the x axis (date) prettier

ax = month.plot(kind='bar', title = "Past Aprils—# of Exceedances", figsize=(20, 10), fontsize=16)
ax

#label trendline
trend=month['Number of pollution permits exceeded'].mean()
ax.axhline(y=trend, color='r', linestyle='-')

#label past 4 Aprils trend (2017, 2018, 2019, 2020)
trend_month=pd.concat([month.loc["2017-04"],month.loc["2018-04"],month.loc["2019-04"]])
trend_month=trend_month['Number of pollution permits exceeded'].mean()
ax.axhline(y=trend_month, color='y', linestyle='-')
Out[65]:
<matplotlib.lines.Line2D at 0x7fe6d52899d0>

In the next chart, the height of each bar will indicate how many facilities have exceeded their permits, while the red line will show us the average number of these for all previous Aprils and the yellow line indicates the average for the past three years.

In [66]:
# Second, let's look at the number of facilities that exceeded their permits.
# Some facilities have multiple permits for multiple pollutants, so the results here may differ from those above

fac=exceeds.reset_index()
fac=fac.set_index(['MONITORING_PERIOD_END_DATE'])
facilities = fac.groupby(['MONITORING_PERIOD_END_DATE']).agg({"NPDES_ID": "nunique"})
facilities = facilities.resample('M').sum()
facilities = facilities.loc[(facilities.index.month == 4)] # Filter back to just Aprils
facilities = facilities.rename(columns={'NPDES_ID': "Number of facilities"})
facilities.index = facilities.index.strftime('%Y-%m') # makes the x axis (date) prettier

ax = facilities.plot(kind='bar', title = "Past Aprils—# of Facilities Exceeding", figsize=(20, 10), fontsize=16)
ax

#label trendline
trend=facilities['Number of facilities'].mean()
ax.axhline(y=trend, color='r', linestyle='-')

#label past 4 Aprils trend (2017, 2018, 2019, 2020)
trend_month=pd.concat([facilities.loc["2017-04"],facilities.loc["2018-04"],facilities.loc["2019-04"]])
trend_month=trend_month['Number of facilities'].mean()
ax.axhline(y=trend_month, color='y', linestyle='-')
Out[66]:
<matplotlib.lines.Line2D at 0x7fe6df68e730>

The height of each bar will indicate by how much pollution permits tend to have been exceeded, while the red line will show us the typical exceedance for all previous Aprils and the yellow line indicates the trend for the past three years.

In [67]:
# Third, let's look at how MUCH these facilities exceeded their permits.
# We'll calculate the median level (in percent terms) permits were exceeded by.

med = exceeds.reset_index()
med = med.set_index(['MONITORING_PERIOD_END_DATE'])
med = med.resample("M").median()
med = med.loc[(med.index.month == 4)]
med = med.rename(columns={'EXCEEDENCE_PCT': "Median exceedance (% above permitted level)"})
med.index = med.index.strftime('%Y-%m') # makes the x axis (date) prettier

ax = med.plot(kind='bar', title = "Past Aprils—MEDIAN % EXCEEDANCE", figsize=(20, 10), fontsize=16)
ax

#label trendline
trend=med['Median exceedance (% above permitted level)'].mean()
ax.axhline(y=trend, color='r', linestyle='-')

#label past 4 Aprils trend (2017, 2018, 2019, 2020)
trend_month=pd.concat([med.loc["2017-04"],med.loc["2018-04"],med.loc["2019-04"]])
trend_month=trend_month['Median exceedance (% above permitted level)'].mean()
ax.axhline(y=trend_month, color='y', linestyle='-')
Out[67]:
<matplotlib.lines.Line2D at 0x7fe6d62c7a90>

Lead in our waters can have serious environmental and public health consequences.

In [68]:
exceeds['MONITORING_PERIOD_END_DATE'] = pd.to_datetime(exceeds['MONITORING_PERIOD_END_DATE'], format="%m/%d/%Y") #format

lead = exceeds[(exceeds['PARAMETER_DESC'].astype(str).str.contains('Lead'))]

lead = lead.groupby(['MONITORING_PERIOD_END_DATE'])['PARAMETER_DESC'].count() # = total number of pollutants exceeded...# only need to keep one column. probably should rename...
lead = lead.resample('M').sum() #resample to a leadly basis and 
lead = lead.loc[(lead.index.month == 4)] # Filter back to just Aprils
lead = pd.DataFrame(lead)
lead = lead.rename(columns={'PARAMETER_DESC': "Number of pollution permits exceeded"})
lead.index = lead.index.strftime('%Y-%m') # makes the x axis (date) prettier

ax = lead.plot(kind='bar', title = "Past Aprils—# of times permits to discharge lead were exceeded", figsize=(20, 10), fontsize=16)
ax

#label trendline
trend=lead['Number of pollution permits exceeded'].mean()
ax.axhline(y=trend, color='r', linestyle='-')

#label past 4 Aprils trend (2017, 2018, 2019, 2020)
trend_lead=pd.concat([lead.loc["2017-04"],lead.loc["2018-04"],lead.loc["2019-04"]])
trend_lead=trend_lead['Number of pollution permits exceeded'].mean()
ax.axhline(y=trend_lead, color='y', linestyle='-')
Out[68]:
<matplotlib.lines.Line2D at 0x7fe6d1e52f70>

Water Quality Monitoring and Reporting

First, we'll look at how facilities regulated under the Clean Water Act have altered their required monitoring practices.

Run the code in the cell below, which will query our copy of the ECHO database and pull information on regulated facilities in your state.

Specifically, we'll find records of facilities out of compliance - that is, violating their permits - due to "Non-Receipt of DMR/Schedule Report" DMR stands for Discharge Monitoring Reports, and are required by the CWA's National Pollutant Discharge Elimination System (NPDES).

Not submitting these reports on schedule can lead to "Reportable Non-Compliance" with NPDES and CWA. According to the EPA, "DMR values not received within 31 days of the DMR form due date result in the generation of a violation code (D80 or D90). ICIS-NPDES identifies these DMR non-receipt violations and automatically creates violation codes for the missing DMR values with monitoring requirements (D80) and missing DMR values with effluent limits (D90). EPA's data sharing policy allows states a 40-day window to report DMR values to EPA's data system; therefore, DMR values reported on time to state agencies and shared with EPA within 40 days do not contribute to permit level noncompliance status."

In this case, "N" does NOT mean no - it just is a code for the kind of violation event we're interested in (non-reporting).

In [69]:
sql = "select NPDES_ID, SCHEDULE_DATE, RNC_DETECTION_CODE" + \
    " from NPDES_PS_VIOLATIONS where RNC_DETECTION_CODE = 'N' and " + \
    " SCHEDULE_DATE like '04/%'"
#" NPDES_ID like '" + my_state + "%'"
url='http://apps.tlt.stonybrook.edu/echoepa/?query='
data_location=url+urllib.parse.quote(sql)

data = pd.read_csv(data_location,encoding='iso-8859-1',header = 0)
data.set_index("NPDES_ID", inplace=True)
data
Out[69]:
SCHEDULE_DATE RNC_DETECTION_CODE
NPDES_ID
NM0030317 04/01/2006 N
NM0023311 04/01/2001 N
IN0025160 04/01/1996 N
IN0031577 04/01/2000 N
IN0052370 04/01/1999 N
... ... ...
NJG000329 04/03/2018 N
MO0028720 04/01/2019 N
NJ0001023 04/01/2018 N
NJ0001023 04/01/2018 N
MO0098132 04/01/2019 N

3540 rows × 2 columns

Let's organize this information by date

We're curious to track whether non-receipt of DMRs has increased due to COVID, so we have to be able to summarize and order facilities' violations across time.

In [70]:
# The NPDES_IDS in ECHO_EXPORTER can contain multiple ids for a facility. 
data['SCHEDULE_DATE'] = pd.to_datetime(data['SCHEDULE_DATE'], format="%m/%d/%Y")
by_date = data.groupby('SCHEDULE_DATE')[['RNC_DETECTION_CODE']].count()
by_date=by_date.resample('M').sum()
#by_date=by_date.loc['2017-01-01':'2020-05-01'] #filter to relatively recent
by_date = by_date.loc[(by_date.index.month == 4)] # Filter back to just Aprils
by_date
Out[70]:
RNC_DETECTION_CODE
SCHEDULE_DATE
1986-04-30 7
1987-04-30 0
1988-04-30 2
1989-04-30 1
1990-04-30 2
1991-04-30 3
1992-04-30 5
1993-04-30 7
1994-04-30 5
1995-04-30 15
1996-04-30 15
1997-04-30 8
1998-04-30 28
1999-04-30 24
2000-04-30 38
2001-04-30 56
2002-04-30 54
2003-04-30 49
2004-04-30 55
2005-04-30 43
2006-04-30 67
2007-04-30 69
2008-04-30 120
2009-04-30 175
2010-04-30 230
2011-04-30 204
2012-04-30 230
2013-04-30 222
2014-04-30 229
2015-04-30 305
2016-04-30 304
2017-04-30 284
2018-04-30 277
2019-04-30 407

Plot this ^ !!!

It's all well and good to have this table, but it's hard to pick out patterns from tabular data. Let's plot it as what's called a histogram in order to see what's going on.

The height of each bar will indicate how many facilities were out of compliance due to missing or late reports, while the red line will show us the average number of these facilities for all previous Aprils and the yellow line indicates the average for the past three years or so.

In [71]:
by_date.index = by_date.index.strftime('%Y-%m') # makes the x axis (date) prettier

chart_title = "Total CWA Non-Compliance due to Missing or Late Reports"
ax = by_date.plot(kind='bar', title = chart_title, figsize=(20, 10), legend=False, fontsize=16)
ax.set_xlabel("Year", fontsize=16)
ax.set_ylabel("Count of facilities", fontsize=16)

#label trendline
trend=by_date['RNC_DETECTION_CODE'].mean()
ax.axhline(y=trend, color='r', linestyle='-')

#label past 4 Aprils trend (2017, 2018, 2019, 2020)
trend_month=pd.concat([by_date.loc["2017-04"],by_date.loc["2018-04"],by_date.loc["2019-04"]])
trend_month=trend_month['RNC_DETECTION_CODE'].mean()
ax.axhline(y=trend_month, color='y', linestyle='-')
Out[71]:
<matplotlib.lines.Line2D at 0x7fe6de4bafd0>

Which facilities didn't report in the month of April?

This will give us a good indicator of the impact of EPA's memo, which went into effect that month.

First, let's get more information about those facilities.

In [72]:
latest = data[(data["SCHEDULE_DATE"] >= '2019-01-01') & (data["SCHEDULE_DATE"] < '2020-05-01')] 

if (len(latest.index)>0):
    #get facility information from ECHO
    sql = "select FAC_NAME, NPDES_IDS, FAC_LAT, FAC_LONG, CWA_QTRS_WITH_NC" + \
        " from ECHO_EXPORTER where NPDES_FLAG = 'Y' "
    url='http://apps.tlt.stonybrook.edu/echoepa/?query='
    data_location=url+urllib.parse.quote(sql)
    echo_data = pd.read_csv(data_location,encoding='iso-8859-1',header = 0)
    echo_data.set_index( 'NPDES_IDS', inplace=True )
    
    #merge echo and npdes data
    latest = latest.join(echo_data)
    print(latest)
    
else:
    print("Actually, there were no reporting violations for April")    
          SCHEDULE_DATE RNC_DETECTION_CODE  \
AKG315101    2019-04-01                  N   
CA0084284    2019-04-01                  N   
CO0020877    2019-04-01                  N   
COG589010    2019-04-15                  N   
COG589010    2019-04-15                  N   
...                 ...                ...   
WV0002496    2019-04-01                  N   
WV0002496    2019-04-01                  N   
WV0027472    2019-04-01                  N   
WV0112640    2019-04-01                  N   
WV0116661    2019-04-01                  N   

                                               FAC_NAME    FAC_LAT  \
AKG315101                                           NaN        NaN   
CA0084284  HOLLYWOOD CASINO WASTE WATER TREATMENT PLANT  32.703194   
CO0020877                                           NaN        NaN   
COG589010                   PURGATORY METROPOLITAN DIST  37.634497   
COG589010                   PURGATORY METROPOLITAN DIST  37.634497   
...                                                 ...        ...   
WV0002496                            ICL-IP AMERICA INC  38.772220   
WV0002496                            ICL-IP AMERICA INC  38.772220   
WV0027472                                           NaN        NaN   
WV0112640          AFFIRM OILFIELD SERVICES, ST. MARY'S  39.345675   
WV0116661                                           NaN        NaN   

             FAC_LONG  CWA_QTRS_WITH_NC  
AKG315101         NaN               NaN  
CA0084284 -116.870583               6.0  
CO0020877         NaN               NaN  
COG589010 -107.806298              12.0  
COG589010 -107.806298              12.0  
...               ...               ...  
WV0002496  -82.205560              10.0  
WV0002496  -82.205560              10.0  
WV0027472         NaN               NaN  
WV0112640  -81.339039               6.0  
WV0116661         NaN               NaN  

[407 rows x 6 columns]

Map them!

Now we'll map those facilities. The ones that have spent more quarters in non-compliance over the past three years will be displayed as larger.

In [73]:
# Filter to remove NaNs - missing data!
latest = latest[~(np.isnan(latest["FAC_LAT"])) | ~(np.isnan(latest["FAC_LONG"]))]

# Generate a scale by which we can classify facilities by quarters in non-compliance, and map them accordingly
latest["quartile"]=pd.qcut(latest["CWA_QTRS_WITH_NC"], 4, labels=False) # quartiles
#latest.merge(scale)
#print(latest)
radii={0:6, 1:10, 2:14, 3: 20}

def mapper(df):
    # Initialize the map
    m = folium.Map(
        location = [df.mean()["FAC_LAT"], df.mean()["FAC_LONG"]],
        zoom_start = 8
    )

    # Add a clickable marker for each facility
    for index, row in df.iterrows():
        #print(index,row)
        folium.CircleMarker(
            location = [row["FAC_LAT"], row["FAC_LONG"]],
            popup = row["FAC_NAME"] + "\n Quarters in Non-Compliance over the past three years: "+str(row["CWA_QTRS_WITH_NC"]),
            radius = radii[row["quartile"]],
            color = "black",
            weight = 1,
            fill_color = "orange",
            fill_opacity= .4
        ).add_to(m)

    bounds = m.get_bounds()
    m.fit_bounds(bounds)

    # Show the map
    return m

map_of_facilities = mapper(latest)
map_of_facilities
Out[73]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Select your state

Let's dissect the patterns above by looking at a specific state...

In [ ]:
 
Notebooks AI
Notebooks AI Profile20060