Importing Data







DOWNLOAD DATASETS

To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

Lecture 1: Getting Started with Python and Pandas

Change folder

In [2]:
cd "D:\Dropbox\CLASSES\Data Science for Finance\PYTHON 1"
D:\Dropbox\CLASSES\Data Science for Finance\PYTHON 1
In [3]:
import pandas as pd

Import Data as DataFrame

Import the gapminder.tsv file from the folder. This is a tab separated file. We shall save the dataframe to df. Please note the
sep = “\t”
. This is only necessary to add when importing a tab separated file
. When we import csv file, this will not be needed.

In [7]:
df = pd.read_csv('gapminder.tsv', sep = '\t')

DataFrames

The df is called an object. To be specific, it is called DataFrame. To know the nature of this object

In [8]:
type(df)
Out[8]:
pandas.core.frame.DataFrame

To know the size and shape of this dataframe. Please note that we first type the object to select it, and then find its attribute

In [98]:
df.shape
Out[98]:
(1704, 6)
In [11]:
# To get information about this dataframe, we can use the info() method
df.info()

RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

Read the first 5 rows of the dataframe

In [12]:
df.head()
Out[12]:
0 country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

INDEX : These are the rows and column indentifiers

To get the column names fo the dataframe

In [13]:
df.columns
Out[13]:
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

Get the row numbers of the dataframe.

In [15]:
df.index
Out[15]:
RangeIndex(start=0, stop=1704, step=1)

The body of the DataFrame

In [20]:
#just to know about the dataframe values
df.values
Out[20]:
array([['Afghanistan', 'Asia', 1952, 28.801, 8425333, 779.4453145],
       ['Afghanistan', 'Asia', 1957, 30.331999999999997, 9240934,
        820.8530296],
       ['Afghanistan', 'Asia', 1962, 31.997, 10267083, 853.1007099999999],
       ...,
       ['Zimbabwe', 'Africa', 1997, 46.809, 11404948, 792.4499602999999],
       ['Zimbabwe', 'Africa', 2002, 39.989000000000004, 11926563,
        672.0386227000001],
       ['Zimbabwe', 'Africa', 2007, 43.486999999999995, 12311143,
        469.70929810000007]], dtype=object)

know the variable types

In [22]:
df.dtypes
Out[22]:
country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

SUBSET COLUMN

In [23]:
#Slice column from the dataframe with square brackets and name of the column in quotes
# Get the country column
country = df['country']
In [24]:
country.head()
Out[24]:
0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

Drop column from dataframe

Use the drop method. The default is to drop thing rowwise, so we specify additonal option of axis = ‘columns’. This does not drop the column in place by default.

In [90]:
df.drop(['country'], axis='columns').head()
Out[90]:
continent year lifeExp pop gdpPercap
0 Asia 1952 28.801 8425333 779.445314
1 Asia 1957 30.332 9240934 820.853030
2 Asia 1962 31.997 10267083 853.100710
3 Asia 1967 34.020 11537966 836.197138
4 Asia 1972 36.088 13079460 739.981106

To drop the column in-place

To drop the column from original dataframe, which is off course a dangerous operation, use the inplace = ‘true’. Alternatively, we can drop the column in place and create another dataframe by way of assignment, i.e. assign the new reduced data to say newdata

In [27]:
newdata = df.drop(['country'], axis ='columns')
In [69]:
newdata.head()
Out[69]:
continent year lifeExp pop gdpPercap
0 Asia 1952 28.801 8425333 779.445314
1 Asia 1957 30.332 9240934 820.853030
2 Asia 1962 31.997 10267083 853.100710
3 Asia 1967 34.020 11537966 836.197138
4 Asia 1972 36.088 13079460 739.981106

SELECTING rows

Rows are represented by numbers, starting at 0. For row selection, we use loc[ ] method, loc stands for location. Let us get the first row in the df dataframe.

In [29]:
df.loc[0]
Out[29]:
country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object
In [31]:
#To get the first two rows. This time we are using two square brackets because we are passing a python list to the loc[]
df.loc[[0,1]]
Out[31]:
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030

LOC vs ILOC

loc actually matches the values of the row labels, whereas iloc works like index. So if we pass a loc[-1] it will return an error as there is no row with label -1, however, iloc[-1] works, it returns the last row of the dataframe.

In [32]:
df.iloc[[0,1,-1]]
Out[32]:
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
1703 Zimbabwe Africa 2007 43.487 12311143 469.709298

SUBSET ROWS AND COLUMNS TOGETHER

Rows and columns can be subsetted using the loc[] method. Inside the loc method, we use the comma to specify rows and columns. The first argument is rows and the second argument is columns. The row and column lists are wrapped in square brackets. Since loc[] methods will try to match the lables, therefore, we have to write the column names. Let us get the first two rwos of the year and pop columns.

In [43]:
subset = df.loc[[0,1],['year','pop']]
In [44]:
subset
Out[44]:
year pop
0 1952 8425333
1 1957 9240934

Use the iloc method for the above

Since the iloc method does not use labels, rather positional numbers, to get the year and pop columns, we shall write index values, i.e. at which position are they located in the dataframe. Since year located in column 2, and pop in column 4, the command would be

In [45]:
subset = df.iloc[[0,1],[2,4]]
In [46]:
subset
Out[46]:
year pop
0 1952 8425333
1 1957 9240934

Get all rows and only first two column

The colon symbol is used to get full row

In [50]:
subset2 = df.iloc[:,[2,4]]
In [52]:
subset2.head()
Out[52]:
year pop
0 1952 8425333
1 1957 9240934
2 1962 10267083
3 1967 11537966
4 1972 13079460

Filter Data

Filtering Column and show records having the country name ‘United States’. We are going to filer the records in several small steps, though these can be combined in one line of code. But that would be desirable once we feel comfortable with the structure of the command. So these are the steps:

  • 1. Get the row numbers of the dataset where the country is United States. This is called getting index of the rows

  • 2. Based on the row indices, we then get the actual rows.

  • 1. Get row indices

    In [72]:
    row_index = df['country']=='United States'
    

    2. Get actual rows from the indices

    In [74]:
    us = df.loc[row_index]
    us.head()
    
    Out[74]:
    country continent year lifeExp pop gdpPercap
    1608 United States Americas 1952 68.44 157553000 13990.48208
    1609 United States Americas 1957 69.49 171984000 14847.12712
    1610 United States Americas 1962 70.21 186538000 16173.14586
    1611 United States Americas 1967 70.76 198712000 19530.36557
    1612 United States Americas 1972 71.34 209896000 21806.03594

    Two Conditions for Filtering

    To break things down and simplify the process, we shall start with creating a separate dataset on the first condition, and then filter further data in the newely created dataset using the second condition. Later, we can use the complete code in one sentence once we are comfortable with the data notations.

    Filter Data that contains rows only for United States where years is greater than 1975

    Step 1: Create a subset of observations that contains only United States
    As before, first find the row indices

    In [77]:
    row_index = df['country'] == 'United States'
    

    Get the actual rows and write them to a new dataset us

    In [79]:
    us = df.loc[row_index]
    us.head()
    
    Out[79]:
    country continent year lifeExp pop gdpPercap
    1608 United States Americas 1952 68.44 157553000 13990.48208
    1609 United States Americas 1957 69.49 171984000 14847.12712
    1610 United States Americas 1962 70.21 186538000 16173.14586
    1611 United States Americas 1967 70.76 198712000 19530.36557
    1612 United States Americas 1972 71.34 209896000 21806.03594

    Step 2: Filter years in this newely created dataset
    First the row indices

    In [80]:
    row_year = df['year'] > 1975
    

    Get the actual rows and write them to a new dataset us_1975

    In [83]:
    us_1975 = us.loc[row_year]
    
    In [84]:
    us_1975.head()
    
    Out[84]:
    country continent year lifeExp pop gdpPercap
    1613 United States Americas 1977 73.38 220239000 24072.63213
    1614 United States Americas 1982 74.65 232187835 25009.55914
    1615 United States Americas 1987 75.02 242803533 29884.35041
    1616 United States Americas 1992 76.09 256894189 32003.93224
    1617 United States Americas 1997 76.81 272911760 35767.43303

    Filter with Two Condition in one Line of Code

    In [88]:
    us_1975 = df[(df['country'] == 'United States') & (df['year'] > 1975)]
    
    In [89]:
    us_1975.head()
    
    Out[89]:
    country continent year lifeExp pop gdpPercap
    1613 United States Americas 1977 73.38 220239000 24072.63213
    1614 United States Americas 1982 74.65 232187835 25009.55914
    1615 United States Americas 1987 75.02 242803533 29884.35041
    1616 United States Americas 1992 76.09 256894189 32003.93224
    1617 United States Americas 1997 76.81 272911760 35767.43303

    STATISTICS BY GROUPS

    We use groupby() function for finding statistics for groups. For example, in this dataset, years can be considered as a group or countries can be considered as groups

    Find average life expectancy in each year

    In [64]:
    df.groupby('year')['lifeExp'].mean()
    
    Out[64]:
    year
    1952    49.057620
    1957    51.507401
    1962    53.609249
    1967    55.678290
    1972    57.647386
    1977    59.570157
    1982    61.533197
    1987    63.212613
    1992    64.160338
    1997    65.014676
    2002    65.694923
    2007    67.007423
    Name: lifeExp, dtype: float64

    Find Average life Expectancy in each continent over years

    In [95]:
    avg = df.groupby(['continent', 'year'])['lifeExp'].mean()
    avg.head(20)
    
    Out[95]:
    continent  year
    Africa     1952    39.135500
               1957    41.266346
               1962    43.319442
               1967    45.334538
               1972    47.450942
               1977    49.580423
               1982    51.592865
               1987    53.344788
               1992    53.629577
               1997    53.598269
               2002    53.325231
               2007    54.806038
    Americas   1952    53.279840
               1957    55.960280
               1962    58.398760
               1967    60.410920
               1972    62.394920
               1977    64.391560
               1982    66.228840
               1987    68.090720
    Name: lifeExp, dtype: float64

    And to reset the result back to flat structure like dataframe, we shall add reset_index()

    In [96]:
    avg = df.groupby(['continent', 'year'])['lifeExp'].mean().reset_index()
    
    In [97]:
    avg.head(20)
    
    Out[97]:
    continent year lifeExp
    0 Africa 1952 39.135500
    1 Africa 1957 41.266346
    2 Africa 1962 43.319442
    3 Africa 1967 45.334538
    4 Africa 1972 47.450942
    5 Africa 1977 49.580423
    6 Africa 1982 51.592865
    7 Africa 1987 53.344788
    8 Africa 1992 53.629577
    9 Africa 1997 53.598269
    10 Africa 2002 53.325231
    11 Africa 2007 54.806038
    12 Americas 1952 53.279840
    13 Americas 1957 55.960280
    14 Americas 1962 58.398760
    15 Americas 1967 60.410920
    16 Americas 1972 62.394920
    17 Americas 1977 64.391560
    18 Americas 1982 66.228840
    19 Americas 1987 68.090720
    In [ ]: