Dataframes of Pandas: Introduction¶
This is an introductory session of Python in general and the Pandas library in particular. We are going to perform the following activities in this lecture.
1. Import comma-separated (CSV) or tab separated (TSV) file into Python as a DataFrame object of the Pandas libray
2. Explore the DataFrame properties
3. Getting subsets of the data.
To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv
Getting Started with Pandas¶
Change folder¶
cd "D:\Dropbox\CLASSES\Data Science for Finance\PYTHON 1"
import pandas as pd
Import Data as DataFrame¶
Import the gapminder.tsv file from the folder. This is a tab separated file. We shall save the dataframe to df. Please note the sep = "\t" . This is only necessary to add when importing a tab separated file . When we import csv file, this will not be needed.
df = pd.read_csv('gapminder.tsv', sep = '\t')
DataFrames¶
The df is called an object. To be specific, it is called DataFrame. To know the nature of this object
type(df)
To know the size and shape of this dataframe. Please note that we first type the object to select it, and then find its attribute
df.shape
# To get information about this dataframe, we can use the info() method
df.info()
Read the first 5 rows of the dataframe¶
df.head()
df.columns
Get the row numbers of the dataframe.¶
df.index
The body of the DataFrame¶
#just to know about the dataframe values
df.values
know the variable types¶
df.dtypes
SUBSET COLUMN¶
#Slice column from the dataframe with square brackets and name of the column in quotes
# Get the country column
country = df['country']
country.head()
Drop column from dataframe¶
Use the drop method. The default is to drop data rowwise. Since we need to drop columns
therefore, we need to specify an additonal option, that is, axis = 'columns'.
Please note that this method does not drop the column in place by default.
df.drop(['country'], axis='columns').head()
To drop the column in-place¶
To drop the column from original dataframe, which is off course a dangerous operation, use the inplace = 'true'. Alternatively, we can drop the column in place and create another dataframe by way of assignment, i.e. assign the new reduced data to say newdata
newdata = df.drop(['country'], axis ='columns')
newdata.head()
SELECTING rows¶
Rows are represented by numbers, starting at 0. For row selection, we use loc[ ] method, loc stands for location. Let us get the first row in the df dataframe.
df.loc[0]
#To get the first two rows. This time we are using two square brackets because we are passing a python list to the loc[]
df.loc[[0,1]]
LOC vs ILOC¶
loc actually matches the values of the row labels, whereas iloc works like index. So if we pass a loc[-1] it will return an error as there is no row with label -1, however, iloc[-1] works, it returns the last row of the dataframe.
df.iloc[[0,1,-1]]
SUBSET ROWS AND COLUMNS TOGETHER¶
Rows and columns can be subsetted using the loc[] method. Inside the loc method, we use the comma to specify rows and columns. The first argument is rows and the second argument is columns. The row and column lists are wrapped in square brackets. Since loc[] methods will try to match the lables, therefore, we have to write the column names. Let us get the first two rwos of the year and pop columns.
subset = df.loc[[0,1],['year','pop']]
subset
Use the iloc method for the above¶
Since the iloc method does not use labels, rather it uses positional numbers, we shall write the index values to get the year and pop columns, i.e. at which position are they located in the dataframe. Since year is located in column 2, and pop is located in column 4, the command would be
subset = df.iloc[[0,1],[2,4]]
subset
Get all rows and only first two column¶
The colon symbol is used to get full row
subset2 = df.iloc[:,[2,4]]
subset2.head()
Filter Data¶
Filtering Column and show records having the country name 'United States'. We are going to filter the records in several small steps, though these can be combined in one line of code. But that would be desirable once we feel comfortable with the structure of the command. So these are the steps:
1. Get row indices¶
row_index = df['country']=='United States'
2. Get actual rows from the indices¶
us = df.loc[row_index]
us.head()
Filtering Data on Two Conditions¶
To break things down and simplify the process, we shall start with creating a separate dataset on the first condition, and then filter further data in the newely created dataset using the second condition. Later, we can use the complete code in one sentence once we are comfortable with the data notations.
Filter records that contains United States and years greater than 1975¶
Step 1: Create a subset of observations that contains only United States As before, first find the row indices
row_index = df['country'] == 'United States'
Get the actual rows and write them to a new dataset us
us = df.loc[row_index]
us.head()
Step 2: Filter years in this newely created dataset First the row indices
row_year = df['year'] > 1975
Get the actual rows and write them to a new dataset us_1975
us_1975 = us.loc[row_year]
us_1975.head()
Filter with Two Condition in one Line of Code¶
us_1975 = df[(df['country'] == 'United States') & (df['year'] > 1975)]
us_1975.head()
STATISTICS BY GROUPS¶
We use groupby() function for finding statistics for groups. For example, in this dataset, years can be considered as a group or countries can be considered as groups
Find average life expectancy in each year¶
df.groupby('year')['lifeExp'].mean()
Find Average life Expectancy in each continent over years¶
avg = df.groupby(['continent', 'year'])['lifeExp'].mean()
avg.head(20)
And to reset the result back to flat structure like dataframe, we shall add reset_index()¶
avg = df.groupby(['continent', 'year'])['lifeExp'].mean().reset_index()
avg.head(20)