Profile picture

Pandas DataFrames

Last updated: May 8th, 20192019-05-08Project preview

rmotr


Pandas - DataFrames

Probably the most important data structure of pandas is the DataFrame. It's a tabular structure tightly integrated with Series.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd

A DataFrame is a tabular structure with the following properties:

  • It's composed by a ordered series of rows and a ordered series of columns.
  • It also uses an index to reference individual rows.
  • Each column could have a different NumPy-related type.
  • It could be seen as a collection of multiple of Series, all sharing the same index.
  • Can be "sliced" horizontally (per row) or vertically (per column).

DataFrames creation

The DataFrame constructor accepts the following parameters:

  • data: (required) has all the data we want to store on the DataFrame and could be a Series dictionary, a sequences dictionary, a bidimensional ndarray, a Series or another DataFrame.
  • index: (optional), has all the labels we want to assign to the rows of our DatFrame and could be a Python sequence or an unidimensional ndarray. Default value: np.arange(0, len(rows)).
  • columns: (optional), has all the labels we want to assign to the columns of our DatFrame and could be a Python sequence or an unidimensional ndarray. Default value: np.arange(0, len(columns)).
  • dtype: (optional) any NumPy data type to be assigned on columns.
In [2]:
# Using a dictionary of sequences
dataframe = pd.DataFrame({'var1': [1, 2, 3],
                          'var2': ['one', 'two', 'three'],
                          'var3': [1.0, 2.0, 3.0]})

dataframe
Out[2]:
var1 var2 var3
0 1 one 1.0
1 2 two 2.0
2 3 three 3.0
In [3]:
# Using a dictionary of Series
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe
Out[3]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [4]:
# Using a ndarray with indexes to rows and columns
dataframe = pd.DataFrame(np.arange(16).reshape(4, 4),
                         index=['r1', 'r2', 'r3', 'r4'],
                         columns=['c1', 'c2', 'c3', 'c4'])
dataframe
Out[4]:
c1 c2 c3 c4
r1 0 1 2 3
r2 4 5 6 7
r3 8 9 10 11
r4 12 13 14 15
In [8]:
# Using a ndarray with indexes to rows and columns, with fixed type
dataframe = pd.DataFrame(np.arange(16).reshape(4,4), dtype=np.int32)

dataframe
Out[8]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [9]:
dataframe.dtypes
Out[9]:
0    int32
1    int32
2    int32
3    int32
dtype: object

green-divider

DataFrame elements

DataFrames expose some useful attributes:

In [6]:
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe
Out[6]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [7]:
# Type of our DataFrame columns
dataframe.dtypes
Out[7]:
var1    float64
var2     object
dtype: object
In [12]:
dataframe
Out[12]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [10]:
# Values of a DataFrame
dataframe.values
Out[10]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]], dtype=int32)
In [11]:
dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
0    4 non-null int32
1    4 non-null int32
2    4 non-null int32
3    4 non-null int32
dtypes: int32(4)
memory usage: 144.0 bytes
In [13]:
type(dataframe.values)
Out[13]:
numpy.ndarray
In [16]:
# Using a ndarray with indexes to rows and columns
dataframe = pd.DataFrame(np.arange(16).reshape(4, 4),
                         index=['r1', 'r2', 'r3', 'r4'],
                         columns=['c1', 'c2', 'c3', 'c4'])
dataframe
Out[16]:
c1 c2 c3 c4
r1 0 1 2 3
r2 4 5 6 7
r3 8 9 10 11
r4 12 13 14 15
In [17]:
# Index of a DataFrame
dataframe.index
Out[17]:
Index(['r1', 'r2', 'r3', 'r4'], dtype='object')
In [18]:
# Columns of a DataFrame
dataframe.columns
Out[18]:
Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
In [19]:
# Dimension of the DataFrame
dataframe.ndim
Out[19]:
2
In [20]:
# Shape of the DataFrame
dataframe.shape
Out[20]:
(4, 4)
In [21]:
# Number of DataFrame elements
dataframe.size
Out[21]:
16

Indexes are immutable, so we can't change individual values independently. However, we can change a complete index with a new index. We'll see that in detail on Reindexing section.

In [22]:
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe
Out[22]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [23]:
# Modifying a row index will give us an error
dataframe.index[0] = 4
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-96113c2111c7> in <module>
      1 # Modifying a row index will give us an error
----> 2 dataframe.index[0] = 4

/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   2048 
   2049     def __setitem__(self, key, value):
-> 2050         raise TypeError("Index does not support mutable operations")
   2051 
   2052     def __getitem__(self, key):

TypeError: Index does not support mutable operations
In [24]:
# Modifying a column index will give us an error
dataframe.columns[0] = 4
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-e244ac47fdd7> in <module>
      1 # Modifying a column index will give us an error
----> 2 dataframe.columns[0] = 4

/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   2048 
   2049     def __setitem__(self, key, value):
-> 2050         raise TypeError("Index does not support mutable operations")
   2051 
   2052     def __getitem__(self, key):

TypeError: Index does not support mutable operations
In [25]:
# This will work
dataframe.index = ['r1', 'r2', 'r3']
dataframe
Out[25]:
var1 var2
r1 1.0 a
r2 2.0 b
r3 3.0 NaN
In [26]:
# This will work
dataframe.columns = ['c1', 'c2']
dataframe
Out[26]:
c1 c2
r1 1.0 a
r2 2.0 b
r3 3.0 NaN

green-divider

The Group of Seven

We'll keep our analysis of "G7 countries" and looking now at DataFrames. As said, a DataFrame looks a lot like a table (as the one you can appreciate here):

image

Creating DataFrames manually can be tedious. 99% of the time you'll be pulling the data from a Database, a csv file or the web. But still, you can create a DataFrame by specifying the columns and values:

In [27]:
import pandas as pd
In [28]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

(The columns attribute is optional. I'm using it to keep the same order as in the picture above)

In [29]:
df
Out[29]:
Population GDP Surface Area HDI Continent
0 35.467 1785387 9984670 0.913 America
1 63.951 2833687 640679 0.888 Europe
2 80.940 3874437 357114 0.916 Europe
3 60.665 2167744 301336 0.873 Europe
4 127.061 4602367 377930 0.891 Asia
5 64.511 2950039 242495 0.907 Europe
6 318.523 17348075 9525067 0.915 America

DataFrames also have indexes. As you can see in the "table" above, pandas has assigned a numeric, autoincremental index automatically to each "row" in our DataFrame. In our case, we know that each row represents a country, so we'll just reassign the index:

In [30]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

df
Out[30]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [31]:
df.dtypes
Out[31]:
Population      float64
GDP               int64
Surface Area      int64
HDI             float64
Continent        object
dtype: object
In [32]:
type(df.values)
Out[32]:
numpy.ndarray
In [33]:
df.ndim
Out[33]:
2
In [34]:
df.shape
Out[34]:
(7, 5)
In [35]:
df.size
Out[35]:
35
In [36]:
df.columns
Out[36]:
Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object')
In [37]:
df.index
Out[37]:
Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
      dtype='object')
In [38]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to United States
Data columns (total 5 columns):
Population      7 non-null float64
GDP             7 non-null int64
Surface Area    7 non-null int64
HDI             7 non-null float64
Continent       7 non-null object
dtypes: float64(2), int64(2), object(1)
memory usage: 336.0+ bytes
In [39]:
df.describe()
Out[39]:
Population GDP Surface Area HDI
count 7.000000 7.000000e+00 7.000000e+00 7.000000
mean 107.302571 5.080248e+06 3.061327e+06 0.900429
std 97.249970 5.494020e+06 4.576187e+06 0.016592
min 35.467000 1.785387e+06 2.424950e+05 0.873000
25% 62.308000 2.500716e+06 3.292250e+05 0.889500
50% 64.511000 2.950039e+06 3.779300e+05 0.907000
75% 104.000500 4.238402e+06 5.082873e+06 0.914000
max 318.523000 1.734808e+07 9.984670e+06 0.916000
In [40]:
df.get_dtype_counts()
Out[40]:
float64    2
int64      2
object     1
dtype: int64
In [41]:
df['Population'].astype(np.int)
Out[41]:
Canada             35
France             63
Germany            80
Italy              60
Japan             127
United Kingdom     64
United States     318
Name: Population, dtype: int64
In [42]:
df
Out[42]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [43]:
df['Rounded Population'] = df['Population'].astype(np.int)
In [44]:
df
Out[44]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318

green-divider

Indexing, Selection and Slicing

Individual columns in the DataFrame can be selected with regular indexing. Each column is represented as a Series:

In [45]:
df['Population']
Out[45]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64
In [46]:
df.Population
Out[46]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64

Note that the index of the returned Series is the same as the DataFrame one. And its name is the name of the column. If you're working on a notebook and want to see a more DataFrame-like format you can use the to_frame method:

In [47]:
df['Population'].to_frame()
Out[47]:
Population
Canada 35.467
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 318.523

Multiple columns can also be specified:

In [48]:
df[['Population', 'GDP']]
Out[48]:
Population GDP
Canada 35.467 1785387
France 63.951 2833687
Germany 80.940 3874437
Italy 60.665 2167744
Japan 127.061 4602367
United Kingdom 64.511 2950039
United States 318.523 17348075

 Indexing by position

In this case, the result is another DataFrame. Slicing works differently, it acts at "row level", and can be counter intuitive:

In [49]:
df[1:3]
Out[49]:
Population GDP Surface Area HDI Continent Rounded Population
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
In [50]:
df[2:3]
Out[50]:
Population GDP Surface Area HDI Continent Rounded Population
Germany 80.94 3874437 357114 0.916 Europe 80
In [ ]:
df[:]

Indexing by loc method, using indexes

Command Behaviour
obj.loc[key] Select by row index
obj.loc[key1:key2] Select by row index
obj.loc[[key1,...,keyn]] Select by row index
obj.loc[condition] Select by row index
obj.loc[sel1, sel2] Select by row index (sel1) y column index (sel2). Selectors: position, slice, sequence, or condition

Row level selection works better with loc and iloc which are recommended over regular "direct slicing" (df[:]).

loc selects rows matching the given index:

In [54]:
df
Out[54]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318
In [55]:
df['Population']
Out[55]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64
In [51]:
df.loc['Italy']
Out[51]:
Population             60.665
GDP                   2167744
Surface Area           301336
HDI                     0.873
Continent              Europe
Rounded Population         60
Name: Italy, dtype: object
In [52]:
df.loc['France': 'Italy']
Out[52]:
Population GDP Surface Area HDI Continent Rounded Population
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60

As a second "argument", you can pass the column(s) you'd like to select:

In [53]:
df.loc['France': 'Italy', 'Population']
Out[53]:
France     63.951
Germany    80.940
Italy      60.665
Name: Population, dtype: float64
In [ ]:
df.loc['France': 'Italy', ['Population', 'GDP']]

Indexing by iloc method, using indexes

Command Behaviour
obj.iloc[num_val] Select by row position
obj.iloc[num_val1:num_val2] Select by row position
obj.iloc[[num_val1,...,num_valn]] Select by row position
obj.iloc[sel1, sel2] Select by row position (sel1) y column position (sel2). Selectors: position, slice or sequence

iloc works with the (numeric) "position" of the index:

In [56]:
df
Out[56]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318
In [57]:
df.iloc[0]
Out[57]:
Population             35.467
GDP                   1785387
Surface Area          9984670
HDI                     0.913
Continent             America
Rounded Population         35
Name: Canada, dtype: object
In [58]:
df.iloc[-1]
Out[58]:
Population             318.523
GDP                   17348075
Surface Area           9525067
HDI                      0.915
Continent              America
Rounded Population         318
Name: United States, dtype: object
In [59]:
df.iloc[[0, 1, -1]]
Out[59]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
United States 318.523 17348075 9525067 0.915 America 318
In [60]:
df.iloc[1:3]
Out[60]:
Population GDP Surface Area HDI Continent Rounded Population
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
In [61]:
df.iloc[1:3, 3]
Out[61]:
France     0.888
Germany    0.916
Name: HDI, dtype: float64
In [62]:
df.iloc[1:3, [0, 3]]
Out[62]:
Population HDI
France 63.951 0.888
Germany 80.940 0.916
In [ ]:
df.iloc[1:3, 1:3]

RECOMMENDED: Always use loc and iloc to reduce ambiguity, specially with DataFrames with numeric indexes.

 Checking existance of a key:

In [63]:
df
Out[63]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318
In [64]:
'Population' in df
Out[64]:
True
In [65]:
'Currency' in df
Out[65]:
False
In [66]:
'Canada' in df.index
Out[66]:
True
In [67]:
'Brasil' in df.index
Out[67]:
False

green-divider

Counting Things

There are a couple of very simple methods to count values in pandas:

In [68]:
df
Out[68]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318
In [70]:
df['Continent'] == 'Asia'
Out[70]:
Canada            False
France            False
Germany           False
Italy             False
Japan              True
United Kingdom    False
United States     False
Name: Continent, dtype: bool
In [69]:
df[df['Continent'] == 'Asia']
Out[69]:
Population GDP Surface Area HDI Continent Rounded Population
Japan 127.061 4602367 377930 0.891 Asia 127
In [71]:
df[df['Continent'] == 'Asia'].shape
Out[71]:
(1, 6)

The count method will only "count" those not-null values (we'll talk about null values in our next module):

In [72]:
df[df['Continent'] == 'Asia'].count()
Out[72]:
Population            1
GDP                   1
Surface Area          1
HDI                   1
Continent             1
Rounded Population    1
dtype: int64
In [ ]:
len(df[df['Continent'] == 'Asia'])

When we have categorical data, we can display the unique occurrences with the method unique():

In [75]:
df
Out[75]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318
In [73]:
df['Continent'].unique()
Out[73]:
array(['America', 'Europe', 'Asia'], dtype=object)

And produce a count with value_counts():

In [74]:
df['Continent'].value_counts()
Out[74]:
Europe     4
America    2
Asia       1
Name: Continent, dtype: int64

green-divider

Reindexing

Indexes are immutable, so we can't change individual values independently. However, we can change a complete index with a new index.

In [ ]:
df

 Changing row indexes

In [76]:
df.index
Out[76]:
Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
      dtype='object')
In [77]:
# Reorder current DataFrame indexes
df.reindex(['France',
            'Germany',
            'Italy',
            'Canada',
            'Japan',
            'United Kingdom',
            'United States'])
Out[77]:
Population GDP Surface Area HDI Continent Rounded Population
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Canada 35.467 1785387 9984670 0.913 America 35
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318
In [ ]:
# Adding a new index value to a DataFrame
df.reindex(['France',
            'Germany',
            'Italy',
            'Canada',
            'Japan',
            'United Kingdom',
            'United States',
            'Brazil'])
In [ ]:
# Adding a new index value to a DataFrame, with default fill value
df.reindex(['France',
            'Germany',
            'Italy',
            'Canada',
            'Japan',
            'United Kingdom',
            'United States',
            'Brasil'], fill_value=0)

Also, indexes can be sorted:

In [78]:
df.sort_index()
Out[78]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318
In [79]:
df.sort_index(ascending=False)
Out[79]:
Population GDP Surface Area HDI Continent Rounded Population
United States 318.523 17348075 9525067 0.915 America 318
United Kingdom 64.511 2950039 242495 0.907 Europe 64
Japan 127.061 4602367 377930 0.891 Asia 127
Italy 60.665 2167744 301336 0.873 Europe 60
Germany 80.940 3874437 357114 0.916 Europe 80
France 63.951 2833687 640679 0.888 Europe 63
Canada 35.467 1785387 9984670 0.913 America 35
In [80]:
df
Out[80]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318

DataFrames indexes can be discarded at any time including it as a new column of our data. New index will be a numerical sequence.

In [82]:
df
Out[82]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318

Also, we can restore a set of columns as DataFrame index:

In [84]:
df.set_index(['GDP'])
Out[84]:
Population Surface Area HDI Continent Rounded Population
GDP
1785387 35.467 9984670 0.913 America 35
2833687 63.951 640679 0.888 Europe 63
3874437 80.940 357114 0.916 Europe 80
2167744 60.665 301336 0.873 Europe 60
4602367 127.061 377930 0.891 Asia 127
2950039 64.511 242495 0.907 Europe 64
17348075 318.523 9525067 0.915 America 318
In [85]:
df
Out[85]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318

 Changing column indexes

In [87]:
df.columns = ['P', 'GDP', 'SA', 'HDI', 'C', 'RP']
df
Out[87]:
P GDP SA HDI C RP
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318
In [88]:
# go back
df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent', 'Rounded Population']

Changing row and column names at once

In [89]:
df.rename(
    columns={
        'HDI': 'Human Development Index',
        'Anual Popcorn Consumption': 'APC'
    },
    index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    }
)
Out[89]:
Population GDP Surface Area Human Development Index Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
UK 64.511 2950039 242495 0.907 Europe 64
USA 318.523 17348075 9525067 0.915 America 318
In [90]:
df
Out[90]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 35.467 1785387 9984670 0.913 America 35
France 63.951 2833687 640679 0.888 Europe 63
Germany 80.940 3874437 357114 0.916 Europe 80
Italy 60.665 2167744 301336 0.873 Europe 60
Japan 127.061 4602367 377930 0.891 Asia 127
United Kingdom 64.511 2950039 242495 0.907 Europe 64
United States 318.523 17348075 9525067 0.915 America 318
In [92]:
df.rename(index=str.upper)
Out[92]:
Population GDP Surface Area HDI Continent Rounded Population
CANADA 35.467 1785387 9984670 0.913 America 35
FRANCE 63.951 2833687 640679 0.888 Europe 63
GERMANY 80.940 3874437 357114 0.916 Europe 80
ITALY 60.665 2167744 301336 0.873 Europe 60
JAPAN 127.061 4602367 377930 0.891 Asia 127
UNITED KINGDOM 64.511 2950039 242495 0.907 Europe 64
UNITED STATES 318.523 17348075 9525067 0.915 America 318
In [93]:
df.rename(index=lambda x: x.lower())
Out[93]:
Population GDP Surface Area HDI Continent Rounded Population
canada 35.467 1785387 9984670 0.913 America 35
france 63.951 2833687 640679 0.888 Europe 63
germany 80.940 3874437 357114 0.916 Europe 80
italy 60.665 2167744 301336 0.873 Europe 60
japan 127.061 4602367 377930 0.891 Asia 127
united kingdom 64.511 2950039 242495 0.907 Europe 64
united states 318.523 17348075 9525067 0.915 America 318

green-divider

Modifying DataFrames

It's simple and intuitive, You can add columns, or replace values for columns without issues:

Replacing values per column

In [94]:
df['Language'] = 'English'
In [95]:
df
Out[95]:
Population GDP Surface Area HDI Continent Rounded Population Language
Canada 35.467 1785387 9984670 0.913 America 35 English
France 63.951 2833687 640679 0.888 Europe 63 English
Germany 80.940 3874437 357114 0.916 Europe 80 English
Italy 60.665 2167744 301336 0.873 Europe 60 English
Japan 127.061 4602367 377930 0.891 Asia 127 English
United Kingdom 64.511 2950039 242495 0.907 Europe 64 English
United States 318.523 17348075 9525067 0.915 America 318 English

Transpose DataFrames

We can transpose rows by columns. This will change column indexes to rows indexes.

In [96]:
df
Out[96]:
Population GDP Surface Area HDI Continent Rounded Population Language
Canada 35.467 1785387 9984670 0.913 America 35 English
France 63.951 2833687 640679 0.888 Europe 63 English
Germany 80.940 3874437 357114 0.916 Europe 80 English
Italy 60.665 2167744 301336 0.873 Europe 60 English
Japan 127.061 4602367 377930 0.891 Asia 127 English
United Kingdom 64.511 2950039 242495 0.907 Europe 64 English
United States 318.523 17348075 9525067 0.915 America 318 English
In [97]:
df.T
Out[97]:
Canada France Germany Italy Japan United Kingdom United States
Population 35.467 63.951 80.94 60.665 127.061 64.511 318.523
GDP 1785387 2833687 3874437 2167744 4602367 2950039 17348075
Surface Area 9984670 640679 357114 301336 377930 242495 9525067
HDI 0.913 0.888 0.916 0.873 0.891 0.907 0.915
Continent America Europe Europe Europe Asia Europe America
Rounded Population 35 63 80 60 127 64 318
Language English English English English English English English

green-divider

Adding new elements to a DataFrame

In [98]:
df
Out[98]:
Population GDP Surface Area HDI Continent Rounded Population Language
Canada 35.467 1785387 9984670 0.913 America 35 English
France 63.951 2833687 640679 0.888 Europe 63 English
Germany 80.940 3874437 357114 0.916 Europe 80 English
Italy 60.665 2167744 301336 0.873 Europe 60 English
Japan 127.061 4602367 377930 0.891 Asia 127 English
United Kingdom 64.511 2950039 242495 0.907 Europe 64 English
United States 318.523 17348075 9525067 0.915 America 318 English

 Adding new row

Empty values will be filled with NaN values.

In [99]:
df = df.append(pd.Series({
    'Population': 50233,
    'GDP': 1485387,
    'Surface Area': 8923670
}, name='Brazil'))

df
Out[99]:
Population GDP Surface Area HDI Continent Rounded Population Language
Canada 35.467 1785387.0 9984670.0 0.913 America 35.0 English
France 63.951 2833687.0 640679.0 0.888 Europe 63.0 English
Germany 80.940 3874437.0 357114.0 0.916 Europe 80.0 English
Italy 60.665 2167744.0 301336.0 0.873 Europe 60.0 English
Japan 127.061 4602367.0 377930.0 0.891 Asia 127.0 English
United Kingdom 64.511 2950039.0 242495.0 0.907 Europe 64.0 English
United States 318.523 17348075.0 9525067.0 0.915 America 318.0 English
Brazil 50233.000 1485387.0 8923670.0 NaN NaN NaN NaN

You can directly set the new index and values to the DataFrame:

In [100]:
df.loc['China'] = pd.Series({'Population': 1_400_000_000, 'Continent': 'Asia'})

df
Out[100]:
Population GDP Surface Area HDI Continent Rounded Population Language
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0 English
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 English
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0 English
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 English
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN NaN NaN
China 1.400000e+09 NaN NaN NaN Asia NaN NaN

 Adding new column

In [101]:
df
Out[101]:
Population GDP Surface Area HDI Continent Rounded Population Language
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0 English
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 English
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0 English
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 English
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN NaN NaN
China 1.400000e+09 NaN NaN NaN Asia NaN NaN
In [102]:
df['Currency'] = ['Canadian Dolar', 'Euro', 'Euro', 'Euro', 'Yen', 'Pound sterling', 'American Dolar', 'Real', 'Yuan']

df
Out[102]:
Population GDP Surface Area HDI Continent Rounded Population Language Currency
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0 English Canadian Dolar
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 English Euro
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 English Euro
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0 English Euro
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 English Yen
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 English Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 English American Dolar
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN NaN NaN Real
China 1.400000e+09 NaN NaN NaN Asia NaN NaN Yuan

green-divider

Removing elements from a DataFrame

Opposed to the concept of selection, we have "dropping". Instead of pointing out which values you'd like to select you could point which ones you'd like to drop:

In [103]:
df
Out[103]:
Population GDP Surface Area HDI Continent Rounded Population Language Currency
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0 English Canadian Dolar
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 English Euro
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 English Euro
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0 English Euro
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 English Yen
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 English Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 English American Dolar
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN NaN NaN Real
China 1.400000e+09 NaN NaN NaN Asia NaN NaN Yuan

 Removing row

In [104]:
df.drop('Brazil', inplace=True)

df
Out[104]:
Population GDP Surface Area HDI Continent Rounded Population Language Currency
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0 English Canadian Dolar
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 English Euro
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 English Euro
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0 English Euro
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 English Yen
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 English Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 English American Dolar
China 1.400000e+09 NaN NaN NaN Asia NaN NaN Yuan
In [105]:
# will return a new dataframe
df.drop(['Canada', 'Japan'])
Out[105]:
Population GDP Surface Area HDI Continent Rounded Population Language Currency
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 English Euro
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 English Euro
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0 English Euro
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 English Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 English American Dolar
China 1.400000e+09 NaN NaN NaN Asia NaN NaN Yuan
In [106]:
df
Out[106]:
Population GDP Surface Area HDI Continent Rounded Population Language Currency
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0 English Canadian Dolar
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 English Euro
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 English Euro
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0 English Euro
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 English Yen
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 English Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 English American Dolar
China 1.400000e+09 NaN NaN NaN Asia NaN NaN Yuan
In [107]:
# will return a new dataframe
df.drop(['Italy', 'Canada'], axis=0)
Out[107]:
Population GDP Surface Area HDI Continent Rounded Population Language Currency
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 English Euro
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 English Euro
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 English Yen
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 English Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 English American Dolar
China 1.400000e+09 NaN NaN NaN Asia NaN NaN Yuan

 Removing columns

In [108]:
# will return a new dataframe
df.drop(columns=['Language'], inplace=True)
df
Out[108]:
Population GDP Surface Area HDI Continent Rounded Population Currency
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0 Canadian Dolar
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 Euro
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 Euro
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0 Euro
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 Yen
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 American Dolar
China 1.400000e+09 NaN NaN NaN Asia NaN Yuan
In [109]:
#del df['Currency']
df.drop('Currency', axis=1, inplace=True)

df
Out[109]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0
China 1.400000e+09 NaN NaN NaN Asia NaN
In [110]:
# will return a new dataframe
df.drop(['Population', 'HDI'], axis=1)
Out[110]:
GDP Surface Area Continent Rounded Population
Canada 1785387.0 9984670.0 America 35.0
France 2833687.0 640679.0 Europe 63.0
Germany 3874437.0 357114.0 Europe 80.0
Italy 2167744.0 301336.0 Europe 60.0
Japan 4602367.0 377930.0 Asia 127.0
United Kingdom 2950039.0 242495.0 Europe 64.0
United States 17348075.0 9525067.0 America 318.0
China NaN NaN Asia NaN
In [ ]:
# will return a new dataframe
df.drop(['Population', 'HDI'], axis='columns')
In [ ]:
df.drop(['Canada', 'Germany'], axis='rows')

By default, the drop method returns a new DataFrame. If you'd like to modify it "in place", you can use the inplace attribute (there's an example below).

green-divider

Conditional selection (boolean arrays)

We saw conditional selection applied to Series and it'll work in the same way for DataFrames. After all, a DataFrame is a collection of Series:

In [111]:
df
Out[111]:
Population GDP Surface Area HDI Continent Rounded Population
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0
China 1.400000e+09 NaN NaN NaN Asia NaN
In [112]:
df['Population'] > 70
Out[112]:
Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
China              True
Name: Population, dtype: bool
In [113]:
df.loc[df['Population'] > 70]
Out[113]:
Population GDP Surface Area HDI Continent Rounded Population
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0
China 1.400000e+09 NaN NaN NaN Asia NaN

The boolean matching is done at Index level, so you can filter by any row, as long as it contains the right indexes. Column selection still works as expected:

In [114]:
df.loc[df['Population'] > 70, 'Population']
Out[114]:
Germany          8.094000e+01
Japan            1.270610e+02
United States    3.185230e+02
China            1.400000e+09
Name: Population, dtype: float64
In [115]:
df.loc[df['Population'] > 70, ['Population', 'GDP']]
Out[115]:
Population GDP
Germany 8.094000e+01 3874437.0
Japan 1.270610e+02 4602367.0
United States 3.185230e+02 17348075.0
China 1.400000e+09 NaN

green-divider

Operations and methods

DataFrames also support vectorized operations and aggregation functions as Numpy:

In [116]:
df.sort_values(['Population'], ascending=False, inplace=True)

df
Out[116]:
Population GDP Surface Area HDI Continent Rounded Population
China 1.400000e+09 NaN NaN NaN Asia NaN
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0
In [117]:
df[['Population', 'GDP']]
Out[117]:
Population GDP
China 1.400000e+09 NaN
United States 3.185230e+02 17348075.0
Japan 1.270610e+02 4602367.0
Germany 8.094000e+01 3874437.0
United Kingdom 6.451100e+01 2950039.0
France 6.395100e+01 2833687.0
Italy 6.066500e+01 2167744.0
Canada 3.546700e+01 1785387.0
In [118]:
df[['Population', 'GDP']] / 100
Out[118]:
Population GDP
China 1.400000e+07 NaN
United States 3.185230e+00 173480.75
Japan 1.270610e+00 46023.67
Germany 8.094000e-01 38744.37
United Kingdom 6.451100e-01 29500.39
France 6.395100e-01 28336.87
Italy 6.066500e-01 21677.44
Canada 3.546700e-01 17853.87

Operations with Series work at a column level, broadcasting down the rows (which can be counter intuitive).

In [119]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])
crisis
Out[119]:
GDP   -1000000.0
HDI         -0.3
dtype: float64
In [120]:
df[['GDP', 'HDI']] + crisis
Out[120]:
GDP HDI
China NaN NaN
United States 16348075.0 0.615
Japan 3602367.0 0.591
Germany 2874437.0 0.616
United Kingdom 1950039.0 0.607
France 1833687.0 0.588
Italy 1167744.0 0.573
Canada 785387.0 0.613

green-divider

Creating columns from other columns

Altering a DataFrame often involves combining different columns into another. For example, in our Countries analysis, we could try to calculate the "GDP per capita", which is just, GDP / Population.

In [121]:
df[['Population', 'GDP']]
Out[121]:
Population GDP
China 1.400000e+09 NaN
United States 3.185230e+02 17348075.0
Japan 1.270610e+02 4602367.0
Germany 8.094000e+01 3874437.0
United Kingdom 6.451100e+01 2950039.0
France 6.395100e+01 2833687.0
Italy 6.066500e+01 2167744.0
Canada 3.546700e+01 1785387.0

The regular pandas way of expressing that, is just dividing each series:

In [122]:
df['GDP'] / df['Population']
Out[122]:
China                      NaN
United States     54464.120330
Japan             36221.712406
Germany           47868.013343
United Kingdom    45729.239975
France            44310.284437
Italy             35733.025633
Canada            50339.385908
dtype: float64

The result of that operation is just another series that you can add to the original DataFrame:

In [123]:
df['GDP Per Capita'] = df['GDP'] / df['Population']
In [124]:
df
Out[124]:
Population GDP Surface Area HDI Continent Rounded Population GDP Per Capita
China 1.400000e+09 NaN NaN NaN Asia NaN NaN
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 54464.120330
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 36221.712406
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 47868.013343
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 45729.239975
France 6.395100e+01 2833687.0 640679.0 0.888 Europe 63.0 44310.284437
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 60.0 35733.025633
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America 35.0 50339.385908

green-divider

Statistical info

You've already seen the describe method, which gives you a good "summary" of the DataFrame. Let's explore other methods in more detail:

In [125]:
df.head()
Out[125]:
Population GDP Surface Area HDI Continent Rounded Population GDP Per Capita
China 1.400000e+09 NaN NaN NaN Asia NaN NaN
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 318.0 54464.120330
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia 127.0 36221.712406
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe 80.0 47868.013343
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 64.0 45729.239975
In [126]:
df.describe()
Out[126]:
Population GDP Surface Area HDI Rounded Population GDP Per Capita
count 8.000000e+00 7.000000e+00 7.000000e+00 7.000000 7.000000 7.000000
mean 1.750001e+08 5.080248e+06 3.061327e+06 0.900429 106.714286 44952.254576
std 4.949747e+08 5.494020e+06 4.576187e+06 0.016592 97.320286 6954.983875
min 3.546700e+01 1.785387e+06 2.424950e+05 0.873000 35.000000 35733.025633
25% 6.312950e+01 2.500716e+06 3.292250e+05 0.889500 61.500000 40265.998421
50% 7.272550e+01 2.950039e+06 3.779300e+05 0.907000 64.000000 45729.239975
75% 1.749265e+02 4.238402e+06 5.082873e+06 0.914000 103.500000 49103.699626
max 1.400000e+09 1.734808e+07 9.984670e+06 0.916000 318.000000 54464.120330
In [ ]:
population = df['Population']
In [ ]:
population.min(), population.max()
In [ ]:
population.sum()
In [ ]:
population.sum() / len(population)
In [ ]:
population.mean()
In [ ]:
population.std()
In [ ]:
population.median()
In [ ]:
population.describe()
In [ ]:
population.quantile(.25)
In [ ]:
population.quantile([.2, .4, .6, .8, 1])

purple-divider

Notebooks AI
Notebooks AI Profile20060