Profile picture

Co-founder @ RMOTR

Analysis: HBO's Chernobyl Impact on Wikipedia

Last updated: July 5th, 20192019-07-05Project preview

Last May, HBO released "Chernobyl", a historical miniseries (just 5 episodes) based on the events of the Chernobyl nuclear disaster.

The show received widespread attention and was source of multiple controversies. Suddenly, my twitter timeline was swarmed with tweets about Chernobyl (the tv show), about the nuclear incident and about the city/country/USSR, etc.

I wanted to know if that "popularity" was global and realistic, or was just the result of my own "echo chamber". A quick Google Trends search found an obvious increase in Google searches:

image

But, how meaningful was it? Did it spark a real worldwide controversy, as sources were claiming? How much impact did it have?

I then realized that Wikipedia could be a good source to investigate the real impact of the tv show. If it generated so much controversy as it was claimed, the article of the nuclear disaster (Chernobyl nuclear disaster) will probably reflect that: both with pageviews and edits.

In [1]:
import itertools
import requests

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Important variables

These are some of the variables we'll use throughout the analysis:

In [5]:
PAGE_TITLE = 'Chernobyl_disaster'
In [6]:
MINISERIES_RELEASE_DATE = pd.Timestamp('2019-05-06')
In [7]:
MINISERIES_LAST_EPISODE_DATE = pd.Timestamp('2019-06-03')

Analyzing pageviews

We want to analyze how many pageviews the article Chernobyl nuclear disaster received, and focus on the time the TV Show started (May 6th, 2019).

To get pageviews from Wikipedia, we'll need to use the Analytics API (which will be different from the one of Revisions we'll use later for the edits). Here you can find the documentation to get pageviews from an article.

To get daily pageviews for a given article, you can use the following endpoint:

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{article}/daily/{start}/{end}

We'll build the URL dynamically, specifying the page title of the article, start and end.

In [8]:
import urllib.parse
In [9]:
BASE_METRICS_URL = "https://wikimedia.org/api/rest_v1/metrics/"
In [10]:
agents = urllib.parse.urljoin(
    BASE_METRICS_URL,
    'pageviews/per-article/en.wikipedia/all-access/all-agents/', True)
In [11]:
PAGEVIEWS_START = pd.Timestamp('2019-01-01')
PAGEVIEWS_END = pd.Timestamp.now().replace(hour=0, minute=0, second=0, microsecond=0, nanosecond=0) - pd.Timedelta('1d')
In [70]:
url = urllib.parse.urljoin(
    agents,
    '{article}/daily/{start}/{end}'.format(
        article=urllib.parse.quote(PAGE_TITLE),
        start=PAGEVIEWS_START.strftime('%Y%m%d00'),
        end=PAGEVIEWS_END.strftime('%Y%m%d00')
    ))

The final URL is:

In [71]:
print(url)
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Chernobyl_disaster/daily/2019010100/2019062400

Now we just need to get the information from the API. To do that we'll use the requests module:

In [72]:
resp = requests.get(url)

Verify the request is successful (status code == 200):

In [73]:
resp.status_code
Out[73]:
200
In [74]:
doc = resp.json()

The returned document has an items key, with pageviews information per day (the day is under timestamp):

In [75]:
doc['items'][:2]
Out[75]:
[{'project': 'en.wikipedia',
  'article': 'Chernobyl_disaster',
  'granularity': 'daily',
  'timestamp': '2019010100',
  'access': 'all-access',
  'agent': 'all-agents',
  'views': 8204},
 {'project': 'en.wikipedia',
  'article': 'Chernobyl_disaster',
  'granularity': 'daily',
  'timestamp': '2019010200',
  'access': 'all-access',
  'agent': 'all-agents',
  'views': 8877}]

We can create now a Pandas DataFrame with that info:

In [76]:
df = pd.DataFrame.from_records(doc['items'])
In [77]:
df.head()
Out[77]:
access agent article granularity project timestamp views
0 all-access all-agents Chernobyl_disaster daily en.wikipedia 2019010100 8204
1 all-access all-agents Chernobyl_disaster daily en.wikipedia 2019010200 8877
2 all-access all-agents Chernobyl_disaster daily en.wikipedia 2019010300 8733
3 all-access all-agents Chernobyl_disaster daily en.wikipedia 2019010400 10807
4 all-access all-agents Chernobyl_disaster daily en.wikipedia 2019010500 7453

The timestamp column hasn't been parse as a timestamp, it's actually an object (pandas version of Strings):

In [78]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175 entries, 0 to 174
Data columns (total 7 columns):
access         175 non-null object
agent          175 non-null object
article        175 non-null object
granularity    175 non-null object
project        175 non-null object
timestamp      175 non-null object
views          175 non-null int64
dtypes: int64(1), object(6)
memory usage: 9.6+ KB

So we need to manually turn it into a timestamp:

In [79]:
df['timestamp'] = pd.to_datetime(df['timestamp'].str[:-2])

Now we sort by that timestamp and we make it the index of the DataFrame:

In [80]:
df.sort_values(by='timestamp', ascending=True, inplace=True)
In [81]:
df.set_index('timestamp', inplace=True)
In [82]:
df.index.is_monotonic_increasing
Out[82]:
True

And we're ready to plot! 🚀

In [83]:
fig, ax = plt.subplots(figsize=(14, 7))

df['views'].plot(ax=ax, label='Pageviews per day')

ax.bar([
    (MINISERIES_RELEASE_DATE + pd.Timedelta('%sd' % x)).to_pydatetime()
    for x in range(-5, 6)
], df.loc[MINISERIES_RELEASE_DATE - pd.Timedelta('5d'): MINISERIES_RELEASE_DATE + pd.Timedelta('5d'), 'views'].values, color='#ff000080')

xpos = MINISERIES_RELEASE_DATE.to_pydatetime()
ypos = df.loc[MINISERIES_RELEASE_DATE, 'views']

xpos_text = (MINISERIES_RELEASE_DATE - pd.Timedelta('30d')).to_pydatetime()
ypos_text = df.loc[MINISERIES_RELEASE_DATE + pd.Timedelta('1d'), 'views']

ax.annotate(
    '1st Episode', xy=(xpos, ypos), xytext=(xpos_text, ypos_text),
    arrowprops=dict(facecolor='red', shrink=0.05),
)

fig.autofmt_xdate()
ax.legend(loc='upper left');

As you can see, there was a huge spike in pageviews after the first episode. The day of the release of the episode we had 54,519 pageviews:

In [30]:
df.loc[MINISERIES_RELEASE_DATE, 'views']
Out[30]:
54519

The following day, there was almost a 7x of that, to 406,780:

In [31]:
df.loc[MINISERIES_RELEASE_DATE + pd.Timedelta('1 day'), 'views']
Out[31]:
406780
In [35]:
df.loc[MINISERIES_RELEASE_DATE: MINISERIES_RELEASE_DATE + pd.Timedelta('1 day'), 'views'].pct_change()
Out[35]:
timestamp
2019-05-06         NaN
2019-05-07    6.461252
Name: views, dtype: float64

A 646% increase 😱!

Comparing pageviews of different articles

To simplify the process of pulling pageviews, we'll build a function 💪, something we always recommend our students to do. The function receives the page title, start and end dates and it'll return a pd.Series with pageviews per day:

In [85]:
def get_pageviews(title, start, end):
    agents = urllib.parse.urljoin(
        BASE_METRICS_URL,
        'pageviews/per-article/en.wikipedia/all-access/all-agents/')
    url = urllib.parse.urljoin(
        agents,
        '{article}/daily/{start}/{end}'.format(
            article=urllib.parse.quote(title),
            start=start.strftime('%Y%m%d00'),
            end=end.strftime('%Y%m%d00')
        ))
    resp = requests.get(url)
    resp.raise_for_status()
    
    df = pd.DataFrame.from_records(resp.json()['items'])
    df['timestamp'] = pd.to_datetime(df['timestamp'].str[:-2])
    df.sort_values(by='timestamp', ascending=True, inplace=True)
    df.set_index('timestamp', inplace=True)
    views = df['views']
    views.name = title
    return views
In [86]:
page_views_series = [
    get_pageviews(title, PAGEVIEWS_START, PAGEVIEWS_END)
    for title in [PAGE_TITLE, 'Chernobyl_(miniseries)', 'Chernobyl_Nuclear_Power_Plant', 'Pripyat']
]

We can now use pd.concat to build a big DataFrame with the views of each page:

In [87]:
df = pd.concat(page_views_series, axis=1)

And finally, we can plot all the timeseries together:

In [91]:
df.plot(figsize=(14, 7))

What about page edits?

The increase in pageviews was expected: a lot of people searching for Chernobyl, checking facts, digging into details, etc. But that doesn't indicate "controversy". Something that I thought could indicate a deeper impact is the number of edits on the article explaining the accident and its aftermath (Chernobyl nuclear disaster). That means that people are gathering more facts and that's generating discussion.

To get "edits" to articles, we have to use the more traditional MediaWiki API which is a little bit more rough-edged. Even though it works on top of HTTP, it's NOT a RESTful API, and it's far from what we're used to as standard of APIs.

The "endpoint" we need to use to get revisions of a page is API:Revisions. Here's a quick example:

In [3]:
url = "https://en.wikipedia.org/w/api.php"

params = {
    "action": "query",
    "prop": "revisions",
    "titles": PAGE_TITLE,
    "rvprop": 'timestamp|flags|user|comment|size',
    "rvslots": "main",
    "formatversion": "2",
    'rvlimit': 500,   
    "format": 'json',
}
In [4]:
resp = requests.get(url, params=params)
In [5]:
doc = resp.json()
In [6]:
doc.keys()
Out[6]:
dict_keys(['continue', 'query'])

The revisions of the page are nested under the revisions key for each page returned by the query. This is the last Revision of the page:

In [7]:
doc['query']['pages'][0]['revisions'][0]
Out[7]:
{'minor': False,
 'user': 'Dougsim',
 'timestamp': '2019-06-24T18:20:40Z',
 'size': 243583,
 'comment': 'incorporated summmary from overview in the lead. Other other text moved to section - evacuation.'}

Which coincides with the last revision shown in the page:

image

Note: you can check all the revisions with this link (note that it might be outdated).

Building a function

As we we did before, and it's recommended, the best thing to do is build a function that encapsulates the functionality that we need. In this case I'll build a function that acts as a "generator", which will pull all the revisions from Wikipedia.

IMPORTANT: Keep in mind that there are can be hundred of thousands of revisions for a single page. Each request returns 500 records, so that'd need multiple hundred of requests to the Wikipedia API. That's why I've added a parameter from_ts to get revisions only starting from that given date. In my case, I only care about revisions from 2019, so my from_ts is equals to 2019-01-01T00:00:00.

In [48]:
def get_all_revisions(page_title, props='timestamp|flags|user|comment|size', from_ts=None):
    url = "https://en.wikipedia.org/w/api.php"

    params = {
        "action": "query",
        "prop": "revisions",
        "titles": page_title,
        "rvprop": props,
        "rvslots": "main",
        "formatversion": "2",
        'rvlimit': 500,   
        "format": 'json',
    }
    if from_ts:
        params = {**params, 'rvstart': from_ts, 'rvdir': 'newer'}
    new_revisions = False
    while True:
            resp = requests.get(url, params=params)
            print("Request done: ", resp.status_code)
            resp.raise_for_status()
            doc = resp.json()
            page = doc['query']['pages'][0]
            if not page.get('revisions'):
                break
            yield from page['revisions']
            if doc.get('batchcomplete'):
                break

            if 'continue' in doc:
                params['rvcontinue'] = doc['continue']['rvcontinue']
            
In [49]:
revisions = get_all_revisions('Chernobyl_disaster', from_ts='2019-01-01T00:00:00')

We can now use this function generator to create a DataFrame, which will read the columns automatically:

In [50]:
df = pd.DataFrame.from_records(revisions)
Request done:  200
In [51]:
df.shape
Out[51]:
(373, 5)
In [52]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
In [53]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 5 columns):
comment      373 non-null object
minor        373 non-null bool
size         373 non-null int64
timestamp    373 non-null datetime64[ns]
user         373 non-null object
dtypes: bool(1), datetime64[ns](1), int64(1), object(2)
memory usage: 12.1+ KB
In [54]:
df.head()
Out[54]:
comment minor size timestamp user
0 fixed links per [[WP:EGG]] and clarity False 234417 2019-01-04 09:31:33 Ita140188
1 {{plainlist}} False 234429 2019-01-08 16:58:52 Hairy Dude
2 [[Pediatrics (journal)]] False 234454 2019-01-10 00:03:38 X1\
3 False 234456 2019-01-11 00:51:52 JackOfDiamondz
4 /* The Exclusion Zone */ Copyedit. False 234470 2019-01-13 18:19:52 Rich Farmbrough
In [55]:
df.sort_values(by='timestamp', ascending=True, inplace=True)

When was the last revision?

In [56]:
df['timestamp'].max()
Out[56]:
Timestamp('2019-06-25 18:17:27')

I need to group now all revisions by day. The simplest way to do it is by using the floor function of pandas:

In [57]:
df['day'] = df['timestamp'].dt.floor('d')
In [58]:
df.head()
Out[58]:
comment minor size timestamp user day
0 fixed links per [[WP:EGG]] and clarity False 234417 2019-01-04 09:31:33 Ita140188 2019-01-04
1 {{plainlist}} False 234429 2019-01-08 16:58:52 Hairy Dude 2019-01-08
2 [[Pediatrics (journal)]] False 234454 2019-01-10 00:03:38 X1\ 2019-01-10
3 False 234456 2019-01-11 00:51:52 JackOfDiamondz 2019-01-11
4 /* The Exclusion Zone */ Copyedit. False 234470 2019-01-13 18:19:52 Rich Farmbrough 2019-01-13

And now we can see how many revisions we have per day:

In [59]:
results = df['day'].value_counts(sort=False)
In [60]:
results.head()
Out[60]:
2019-06-08    9
2019-05-08    2
2019-01-04    1
2019-06-22    4
2019-05-22    7
Name: day, dtype: int64

I'll sort this by day, in ascending order:

In [61]:
results.sort_index(ascending=True, inplace=True)
In [62]:
results.head()
Out[62]:
2019-01-04    1
2019-01-08    1
2019-01-10    1
2019-01-11    1
2019-01-13    1
Name: day, dtype: int64

And now I'll group it by week:

In [63]:
weekly_stats = results.resample('W-Mon', label='left', closed='left').sum()
In [64]:
weekly_stats.head()
Out[64]:
2018-12-31    1
2019-01-07    4
2019-01-14    3
2019-01-21    4
2019-01-28    2
Freq: W-MON, Name: day, dtype: int64

Time to plot!

In [67]:
fig, ax = plt.subplots(figsize=(14, 7))
weekly_stats.plot(kind='bar', color='steelblue', ax=ax, label='Number of edits per week')

ax.set_xticklabels([x.strftime('%Y-%m-%d') for x in weekly_stats.index])
xpos = list(weekly_stats.index).index(MINISERIES_RELEASE_DATE)
ypos = weekly_stats[MINISERIES_RELEASE_DATE]

ax.annotate(
    '1st Episode', xy=(xpos, ypos), xytext=(xpos-3, ypos + 10),
    arrowprops=dict(facecolor='red', shrink=0.05),
    #horizontalalignment='right', verticalalignment='top',
)

xpos = list(weekly_stats.index).index(MINISERIES_LAST_EPISODE_DATE)
ypos = weekly_stats[MINISERIES_LAST_EPISODE_DATE]
ax.annotate(
    'Last Episode', xy=(xpos, ypos), xytext=(xpos + 1, ypos + 6),
    arrowprops=dict(facecolor='red', shrink=0.05),
    #horizontalalignment='right', verticalalignment='top',
)

fig.autofmt_xdate()
ax.legend(loc='upper left');

Clearly, the series had a deep impact in the number of editions of the wikipedia page.

Notebooks AI
Notebooks AI Profile20060