Data Cleaning¶
In [18]:
import pandas as pd
In [19]:
data = pd.read_csv('https://opendoors.pk/wp-content/uploads/2020/02/prices.csv')
In [20]:
data.head()
Out[20]:
In [21]:
data.info()
So it is an object, convert to numeric¶
In [22]:
# slice the Close column to make a seperarte series
close = data['Close']
In [23]:
# See the first value of close
close[1]
Out[23]:
Remove the undesirable string \xa0¶
1. The Python list way¶
In [24]:
# First convert the series to a python list
plist = close.tolist()
In [25]:
# How many elements are there of plist
N = len(plist)
N
Out[25]:
Manually remove the unwanted characters from the first 5 elements of plist¶
In [26]:
plist[0] = plist[0].replace('\xa0 ', '')
In [27]:
plist[1] = plist[1].replace('\xa0 ', '')
plist[2] = plist[2].replace('\xa0 ', '')
plist[3] = plist[3].replace('\xa0 ', '')
plist[4] = plist[4].replace('\xa0 ', '')
We can use a loop¶
In [28]:
for i in range(N) :
plist[i] = plist[i].replace('\xa0 ', '')
# Convert string to float
plist[i] = float(plist[i])
In [30]:
# List just 11 elements to see whether the values are proper float
plist[1:10]
Out[30]:
Get back the list to pandas series¶
In [31]:
close = pd.Series(plist)
In [34]:
close.head()
Out[34]:
2. Pandas Way¶
The above fragements of code is a goog python exercise. However, there a better and easy strategy using the pandas' built-in str.split() method
In [35]:
close2 = data['Close']
In [81]:
splitted = close2.str.split('\xa0 ', expand = True)
In [84]:
splitted.info()
In [92]:
# drop column 0 as it has no data
splitted.drop([0], axis = 'columns', inplace = True)
In [91]:
splitted.head()
Out[91]:
In [93]:
splitted.info()
In [96]:
# Convert the series from string object to float
splitted = splitted.astype(float)
In [98]:
splitted.describe()
Out[98]:
In [ ]:
Data Cleaning¶
In [18]:
import pandas as pd
In [19]:
data = pd.read_csv('https://opendoors.pk/wp-content/uploads/2020/02/prices.csv')
In [20]:
data.head()
Out[20]:
In [21]:
data.info()
So it is an object, convert to numeric¶
In [22]:
# slice the Close column to make a seperarte series
close = data['Close']
In [23]:
# See the first value of close
close[1]
Out[23]:
Remove the undesirable string \xa0¶
1. The Python list way¶
In [24]:
# First convert the series to a python list
plist = close.tolist()
In [25]:
# How many elements are there of plist
N = len(plist)
N
Out[25]:
Manually remove the unwanted characters from the first 5 elements of plist¶
In [26]:
plist[0] = plist[0].replace('\xa0 ', '')
In [27]:
plist[1] = plist[1].replace('\xa0 ', '')
plist[2] = plist[2].replace('\xa0 ', '')
plist[3] = plist[3].replace('\xa0 ', '')
plist[4] = plist[4].replace('\xa0 ', '')
We can use a loop¶
In [28]:
for i in range(N) :
plist[i] = plist[i].replace('\xa0 ', '')
# Convert string to float
plist[i] = float(plist[i])
In [30]:
# List just 11 elements to see whether the values are proper float
plist[1:10]
Out[30]:
Get back the list to pandas series¶
In [31]:
close = pd.Series(plist)
In [34]:
close.head()
Out[34]:
2. Pandas Way¶
The above fragements of code is a goog python exercise. However, there a better and easy strategy using the pandas' built-in str.split() method
In [35]:
close2 = data['Close']
In [81]:
splitted = close2.str.split('\xa0 ', expand = True)
In [84]:
splitted.info()
In [92]:
# drop column 0 as it has no data
splitted.drop([0], axis = 'columns', inplace = True)
In [91]:
splitted.head()
Out[91]:
In [93]:
splitted.info()
In [96]:
# Convert the series from string object to float
splitted = splitted.astype(float)
In [98]:
splitted.describe()
Out[98]:
In [ ]: