DOWNLOAD DATASETS¶
To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv
Lecture 1: Getting Started with Python and Pandas¶
Change folder¶
cd "D:\Dropbox\CLASSES\Data Science for Finance\PYTHON 1"
import pandas as pd
Import Data as DataFrame¶
Import the gapminder.tsv file from the folder. This is a tab separated file. We shall save the dataframe to df. Please note the
sep = “\t” . This is only necessary to add when importing a tab separated file
. When we import csv file, this will not be needed.
df = pd.read_csv('gapminder.tsv', sep = '\t')
DataFrames¶
The df is called an object. To be specific, it is called DataFrame. To know the nature of this object
type(df)
To know the size and shape of this dataframe. Please note that we first type the object to select it, and then find its attribute
df.shape
# To get information about this dataframe, we can use the info() method
df.info()
Read the first 5 rows of the dataframe¶
df.head()
df.columns
Get the row numbers of the dataframe.¶
df.index
The body of the DataFrame¶
#just to know about the dataframe values
df.values
know the variable types¶
df.dtypes
SUBSET COLUMN¶
#Slice column from the dataframe with square brackets and name of the column in quotes
# Get the country column
country = df['country']
country.head()
Drop column from dataframe¶
Use the drop method. The default is to drop thing rowwise, so we specify additonal option of axis = ‘columns’. This does not drop the column in place by default.
df.drop(['country'], axis='columns').head()
To drop the column in-place¶
To drop the column from original dataframe, which is off course a dangerous operation, use the inplace = ‘true’. Alternatively, we can drop the column in place and create another dataframe by way of assignment, i.e. assign the new reduced data to say newdata
newdata = df.drop(['country'], axis ='columns')
newdata.head()
SELECTING rows¶
Rows are represented by numbers, starting at 0. For row selection, we use loc[ ] method, loc stands for location. Let us get the first row in the df dataframe.
df.loc[0]
#To get the first two rows. This time we are using two square brackets because we are passing a python list to the loc[]
df.loc[[0,1]]
LOC vs ILOC¶
loc actually matches the values of the row labels, whereas iloc works like index. So if we pass a loc[-1] it will return an error as there is no row with label -1, however, iloc[-1] works, it returns the last row of the dataframe.
df.iloc[[0,1,-1]]
SUBSET ROWS AND COLUMNS TOGETHER¶
Rows and columns can be subsetted using the loc[] method. Inside the loc method, we use the comma to specify rows and columns. The first argument is rows and the second argument is columns. The row and column lists are wrapped in square brackets. Since loc[] methods will try to match the lables, therefore, we have to write the column names. Let us get the first two rwos of the year and pop columns.
subset = df.loc[[0,1],['year','pop']]
subset
Use the iloc method for the above¶
Since the iloc method does not use labels, rather positional numbers, to get the year and pop columns, we shall write index values, i.e. at which position are they located in the dataframe. Since year located in column 2, and pop in column 4, the command would be
subset = df.iloc[[0,1],[2,4]]
subset
Get all rows and only first two column¶
The colon symbol is used to get full row
subset2 = df.iloc[:,[2,4]]
subset2.head()
Filter Data¶
Filtering Column and show records having the country name ‘United States’. We are going to filer the records in several small steps, though these can be combined in one line of code. But that would be desirable once we feel comfortable with the structure of the command. So these are the steps:
1. Get row indices¶
row_index = df['country']=='United States'
2. Get actual rows from the indices¶
us = df.loc[row_index]
us.head()
Two Conditions for Filtering¶
To break things down and simplify the process, we shall start with creating a separate dataset on the first condition, and then filter further data in the newely created dataset using the second condition. Later, we can use the complete code in one sentence once we are comfortable with the data notations.
Filter Data that contains rows only for United States where years is greater than 1975¶
Step 1: Create a subset of observations that contains only United States
As before, first find the row indices
row_index = df['country'] == 'United States'
Get the actual rows and write them to a new dataset us
us = df.loc[row_index]
us.head()
Step 2: Filter years in this newely created dataset
First the row indices
row_year = df['year'] > 1975
Get the actual rows and write them to a new dataset us_1975
us_1975 = us.loc[row_year]
us_1975.head()
Filter with Two Condition in one Line of Code¶
us_1975 = df[(df['country'] == 'United States') & (df['year'] > 1975)]
us_1975.head()
STATISTICS BY GROUPS¶
We use groupby() function for finding statistics for groups. For example, in this dataset, years can be considered as a group or countries can be considered as groups
Find average life expectancy in each year¶
df.groupby('year')['lifeExp'].mean()
Find Average life Expectancy in each continent over years¶
avg = df.groupby(['continent', 'year'])['lifeExp'].mean()
avg.head(20)
And to reset the result back to flat structure like dataframe, we shall add reset_index()¶
avg = df.groupby(['continent', 'year'])['lifeExp'].mean().reset_index()
avg.head(20)