Exploratory analysis of BSE stock data using pandas-Part 1

Pandas is a very popular python library for data analysis. It’s really simple to get started with and offers a lot of tools for working with data. Pandas also builds on top of other libraries like matplotlib which makes it super convenient for making graphs and being able to interpret data more intuitively.

In this post, lets try to examine some data of stock prices from BSE using python and the pandas library. I’m using the Ipython python interpreter to run some tests, but eventually I create a generalized python script with functions to handle repetitive commands.

If you’re already familiar with pandas and just want to see some examples, jump here:

Datasets:

The BSE Historical Stock Prices contained information that I thought would be interesting to look at more closely. I downloaded data for three different stocks from 1st Jan 2014 to 26th Nov 2019.

Prerequisites:

Make sure you have pandas and matplotlib installed
Any python interpreter

Get the code:

You can find the scripts I wrote here

Analysis

Pandas has data structures known as series and dataframes, which makes it super handy to store data in a format that’s convenient to access. Firstly, I think we should examine the datasets that we just downloaded.

import pandas as pd df=pd.read_csv("tcsdata.csv")

I’m reading the csv data into a dataframe. Now let’s look at the columns:

df.columns

Examining the actual dataframe, we notice that the index is not the date. Changing the index to the date seems more logical since this is how we will want to access our data.

Before changing that however, if you look at the datatype of the Date column, you will notice that it is of a datatype object and not of a datatype “date”. This is easy to change.

df['Date']=pd.to_datetime(df['Date'])

Now change the index of the dataframe to the Date column.

df=df.set_index('Date')

Since I dont want to do this for every csv file Im importing, I write a function to handle this, and make this step quicker

def create_df(path): df=pd.read_csv(path) df['Date']=pd.to_datetime(df['Date']) df=df.set_index('Date') return df

Now I want to create a dataframe using the Infosys dataset from the date 1st Jan 2014 to 26th Nov 2019, and look at the High prices of the stocks. To do this, I create a Series using the High Price column of the infosys dataset with date as index. A Series is like a dataframe except it has just one column.

i_highs=pd.Series(infosysdata['High Price'])

Pandas makes it super easy to visualise this too.

All you have to do is call

i_highs.plot()

and you get a graph of the series:

Here I just look at the daily High Prices over a period of five years. Now I really want to see what the Monthly Highs look like, but only for the year 2019. For this, Im going to create another series with the High Price for 2019.

i_high_19=i_high.loc[i_high.index>='2019-01-01']

.loc is used to return the entire rows from the series.

Now, I will loop over this Series and return a series of each month, then find the highest value for each month

you can use .month on a Date object to return the month of that date.

a=i_high_19[i_high_19.index.month==1]

will return a subset of i_high_19 where the month is 1.

Pandas comes with a function .max() which easily helps me find the maximum value from the Series.

a.max()

Now, I can loop through the Series, and do the same thing for all months.

for i in range(1,11): high=(i_high_19.loc[i_high_19.index.month==i].max()) a=a.append(i_high_19[(i_high_19==high)&(i_high_19.index.month==i)])

We can plot the series:

Creating generalized functions

Since I want to work with a lot of datasets and create many different series to look at, its going to get pretty tedious creating the dataframe, setting the index, choosing columns etc, so I decided to write some generalized functions which could help me do this faster

You can find the code here.

get_data.py contains some different functions:

def create_df(path):

This takes in the path name of the csv file and returns a dataframe with the index set to the date column

def get_sub(start_date,end_date,df):

This function takes in a dataframe and creates a subset of it based on the start and end dates.

def get_periods(period,df):

This takes in a string “month” or “year” and a dataframe and returns the available years or months in that dataframe

def do_action(period,column,action,df):

This function returns a series and takes the following parameters:

period-a string “month” specifying that you want to work with the monthly data of a year or “year” specifying that you want to work with the yearly data over the span of a few years. The string “daily” just returns all the columns.
column-A string that specifies which column name of the dataframe you want to work with.
action-This can either be “max”,”min” or na. A period of “month” and an action of “max” will return the monthly highs, while a period of “year” and an action of “max” will return the yearly highs. “na” does nothing and instead returns all the columns as is
df-This is the dataframe you want to use

Feel free to add/modify the code if you need to. For example, if you need an average function, add it as a switch in the do_action method.

The file plt_data.py contains only one function for now:

def simple(legend,*sers):

This function takes in one or more series and creates a simple line chart. It also takes an array of strings [legend] specifying the corresponding label for each series.

Test Cases:

Here are some examples of how I used the functions and played with the data to create some graphs

Yearly highs and Yearly lows for the TCS stock data for a period of five years:

First, create the dataframe:

tcsdata=get_data.create_df("Downloads/csvs/tcsdataset.csv")

Now create a series for yearly highs:

tcs_year_highs=get_data.do_action("year","High Price","max",tcsdata)

And a series for yearly lows:

tcs_year_lows=get_data.do_action("year","Low Price","min",tcsdata)

Now plot both series:

plt_data.simple(["tcs-high","tcs-low"],tcs_year_highs,tcs_year_lows)

2-Comparison of yearly highs of Infosys and TCS for a period of five years:

Let us create a dataframe from the Infosys dataset:

infosysdata=get_data.create_df("Downloads/csvs/infosysdataset.csv")

Find the yearly highs from this dataframe:

infosys_year_highs=get_data.do_action("year","High Price","max",infosysdata)

Now plot the tcs_year_highs and infosys_year_highs series:

plt_data.simple(["tcs","infy"],tcs_year_highs,infosys_year_highs)

3-Daily high, low and closing price for Wipro stock with daily prices over a one year period:

As before, create a dataframe for Wipro:

wiprodata=get_data.create_df("Downloads/csvs/wiprodataset.csv")

Lets work with the data for the year 2017:

wiprodata17=get_data.get_sub('2017-01-01','2017-12-31',wiprodata)

Now, create different series for high price,low price and closing price:

wiprodata17_high=wiprodata17['High Price'] wiprodata17_low=wiprodata17['Low Price'] wiprodata17_close=wiprodata17['Close Price']

Now just plot all three Series:

plt_data.simple(["wipro-high","wipro-low","wipro-close"],wiprodata17_high,wiprodata17_low,wiprodata17_close)

Closing thoughts

Its awesome how easy Pandas is to get started with and quickly create graphs and manipulate data. For further reading, I definitely recommend Learning Pandas by Michael Heydt. In the next post, I am going to try plotting different kinds of graphs and maybe adding more functionality to the program. I hope you enjoyed reading this and for any suggestions/feedback, please drop a comment below 🙂