Dataframes of Pandas: Introduction¶

This is an introductory session of Python in general and the Pandas library in particular. We are going to perform the following activities in this lecture.
1. Import comma-separated (CSV) or tab separated (TSV) file into Python as a DataFrame object of the Pandas libray
2. Explore the DataFrame properties
3. Getting subsets of the data.

To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

Getting Started with Pandas¶

Change folder¶

cd "D:\Dropbox\CLASSES\Data Science for Finance\PYTHON 1"

D:\Dropbox\CLASSES\Data Science for Finance\PYTHON 1

import pandas as pd

Import Data as DataFrame¶

Import the gapminder.tsv file from the folder. This is a tab separated file. We shall save the dataframe to df. Please note the sep = "\t" . This is only necessary to add when importing a tab separated file . When we import csv file, this will not be needed.

df = pd.read_csv('gapminder.tsv', sep = '\t')

DataFrames¶

The df is called an object. To be specific, it is called DataFrame. To know the nature of this object

type(df)

pandas.core.frame.DataFrame

To know the size and shape of this dataframe. Please note that we first type the object to select it, and then find its attribute

df.shape

(1704, 6)

# To get information about this dataframe, we can use the info() method
df.info()


RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

Read the first 5 rows of the dataframe¶

df.head()

INDEX : These are the rows and column indentifiers¶

To get the column names fo the dataframe¶

df.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

Get the row numbers of the dataframe.¶

df.index

RangeIndex(start=0, stop=1704, step=1)

The body of the DataFrame¶

#just to know about the dataframe values
df.values

array([['Afghanistan', 'Asia', 1952, 28.801, 8425333, 779.4453145],
       ['Afghanistan', 'Asia', 1957, 30.331999999999997, 9240934,
        820.8530296],
       ['Afghanistan', 'Asia', 1962, 31.997, 10267083, 853.1007099999999],
       ...,
       ['Zimbabwe', 'Africa', 1997, 46.809, 11404948, 792.4499602999999],
       ['Zimbabwe', 'Africa', 2002, 39.989000000000004, 11926563,
        672.0386227000001],
       ['Zimbabwe', 'Africa', 2007, 43.486999999999995, 12311143,
        469.70929810000007]], dtype=object)

know the variable types¶

df.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

SUBSET COLUMN¶

#Slice column from the dataframe with square brackets and name of the column in quotes
# Get the country column
country = df['country']

country.head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

Drop column from dataframe¶

Use the drop method. The default is to drop data rowwise. Since we need to drop columns therefore, we need to specify an additonal option, that is, axis = 'columns'.
Please note that this method does not drop the column in place by default.

df.drop(['country'], axis='columns').head()

To drop the column in-place¶

To drop the column from original dataframe, which is off course a dangerous operation, use the inplace = 'true'. Alternatively, we can drop the column in place and create another dataframe by way of assignment, i.e. assign the new reduced data to say newdata

newdata = df.drop(['country'], axis ='columns')

newdata.head()

SELECTING rows¶

Rows are represented by numbers, starting at 0. For row selection, we use loc[ ] method, loc stands for location. Let us get the first row in the df dataframe.

df.loc[0]

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

#To get the first two rows. This time we are using two square brackets because we are passing a python list to the loc[]
df.loc[[0,1]]

LOC vs ILOC¶

loc actually matches the values of the row labels, whereas iloc works like index. So if we pass a loc[-1] it will return an error as there is no row with label -1, however, iloc[-1] works, it returns the last row of the dataframe.

df.iloc[[0,1,-1]]

SUBSET ROWS AND COLUMNS TOGETHER¶

Rows and columns can be subsetted using the loc[] method. Inside the loc method, we use the comma to specify rows and columns. The first argument is rows and the second argument is columns. The row and column lists are wrapped in square brackets. Since loc[] methods will try to match the lables, therefore, we have to write the column names. Let us get the first two rwos of the year and pop columns.

subset = df.loc[[0,1],['year','pop']]

subset

Use the iloc method for the above¶

Since the iloc method does not use labels, rather it uses positional numbers, we shall write the index values to get the year and pop columns, i.e. at which position are they located in the dataframe. Since year is located in column 2, and pop is located in column 4, the command would be

subset = df.iloc[[0,1],[2,4]]

subset

Get all rows and only first two column¶

The colon symbol is used to get full row

subset2 = df.iloc[:,[2,4]]

subset2.head()

Filter Data¶

Filtering Column and show records having the country name 'United States'. We are going to filter the records in several small steps, though these can be combined in one line of code. But that would be desirable once we feel comfortable with the structure of the command. So these are the steps:

1. Get the row numbers of the dataset where the country is United States. This is called getting index of the rows

2. Based on the row indices, we then get the actual rows.

1. Get row indices¶

row_index = df['country']=='United States'

2. Get actual rows from the indices¶

us = df.loc[row_index]
us.head()

Filtering Data on Two Conditions¶

To break things down and simplify the process, we shall start with creating a separate dataset on the first condition, and then filter further data in the newely created dataset using the second condition. Later, we can use the complete code in one sentence once we are comfortable with the data notations.

Filter records that contains United States and years greater than 1975¶

Step 1: Create a subset of observations that contains only United States As before, first find the row indices

row_index = df['country'] == 'United States'

Get the actual rows and write them to a new dataset us

us = df.loc[row_index]
us.head()

Step 2: Filter years in this newely created dataset First the row indices

row_year = df['year'] > 1975

Get the actual rows and write them to a new dataset us_1975

us_1975 = us.loc[row_year]

us_1975.head()

Filter with Two Condition in one Line of Code¶

us_1975 = df[(df['country'] == 'United States') & (df['year'] > 1975)]

us_1975.head()

STATISTICS BY GROUPS¶

We use groupby() function for finding statistics for groups. For example, in this dataset, years can be considered as a group or countries can be considered as groups

Find average life expectancy in each year¶

df.groupby('year')['lifeExp'].mean()

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

Find Average life Expectancy in each continent over years¶

avg = df.groupby(['continent', 'year'])['lifeExp'].mean()
avg.head(20)

continent  year
Africa     1952    39.135500
           1957    41.266346
           1962    43.319442
           1967    45.334538
           1972    47.450942
           1977    49.580423
           1982    51.592865
           1987    53.344788
           1992    53.629577
           1997    53.598269
           2002    53.325231
           2007    54.806038
Americas   1952    53.279840
           1957    55.960280
           1962    58.398760
           1967    60.410920
           1972    62.394920
           1977    64.391560
           1982    66.228840
           1987    68.090720
Name: lifeExp, dtype: float64

And to reset the result back to flat structure like dataframe, we shall add reset_index()¶

avg = df.groupby(['continent', 'year'])['lifeExp'].mean().reset_index()

avg.head(20)

Getting Started with Pandas | Importing and Manipulating Data