Examining The Correlation of Major U.S. Indices

The S&P 500 Index:

The Standard & Poors 500 Index (S&P 500) consists of 500 of the largest U.S. publicly traded companies. It is widely used as a benchmark index for U.S. investors and arguably one of the most popular indices in the world.

Correlation is a relationship between two or more objects. In our usage, the correlation coefficient will be between -1 and 1. There are different types of correlation coefficients including:

  • Sample Correlation Coefficient
  • Population Correlation Coefficient
  • Pearson Correlation Coefficient

To outline the immediate differences between each I put together a sample data set and compared each. Note, correl refers to the built-in correl() function within Excel. If you have an entire population or large subset of a population, you would use a population-based formula. If we only had a small sample given from a dataset, we would estimate the population standard deviation from the sample. In terms of large data sets, there are little discrepancies.

We will be using the Pearson Correlation Coefficient. In general, a value -1 will indicate that the assets are perfectly uncorrelated and or do not move in tandem. This means when one goes up, the other goes down. A correlation coefficient of 1 indicates that the assets are perfectly correlated. This means when one moves up, the other does too (they move in tandem).

In general, a value -1 will indicate that the assets are perfectly uncorrelated and or do not move in tandem. This means when one goes up, the other goes down. A correlation coefficient of 1 indicates that the assets are perfectly correlated. This means when one moves up, the other does too (they move in tandem).

We can graph correlation in a few different ways in order to visualize it. I made a post about doing this in Excel where we conditionally formatted correlation coefficients. A value of -1 would be green to indicate no apparent correlation.  A value of 0 would be colored yellow to indicate no discernable relationship. A value of 1 would be colored red to indicate high correlation. Theoretically, if we saw more green in the matrix, it would imply the index components are largely uncorrelated. A completely red matrix would imply the assets are all correlated. We could implement something like this in python, but why make more work for ourselves when we have matplotlib to work with! We will use the heatmap function within the matplotlib library to make our lives super easy.

To start, we will create a class where we can utilize different methods as needed. We will initialize a few variables at first, including a start date, end date, and analysis type which will either be on returns or prices. We will choose a time frame representative of a run in the market. This will correspond to January 29, 2016 to October 4, 2017. The modules required for this script to run are also displayed here

We then need to gather tickers for an index. You can find CSV’s on the Internet, but I prefer to web scrape them in order to make sure the ticker lists are continually updated over time. All we need to do here is paste in the desired URL in wikipedia. If the stock data is organized within a table, then this function will extract the tickers from it.

Next, we need to convert our input dates to a DateTime object to work with the Pandas datareader in order to pull prices from quandl.

Next, we can gather data for each stock using a for loop and the pandas web datareader. The datareader will allow us to access historical data from sources like Google Finance and Yahoo Finance. Keep in mind the Yahoo API is defunct, so we will be using an alternate for a data source. We will use quandl as a source of data here. Quandle is an awesome website which I highly encourage you all to check out. We will create directories for the data also using the os module. Once we have this data, we can calculate the daily changes easily with the  pct_change() function. We will store the respective data in the proper directories.

The next step is to collate the data. We can do this in a few ways. We could join the columns to a blank data frame, or append them.

The next step is to calculate correlation and display our output. We can create a correlation matrix for an entire dataframe with the df_corr() function. To graph this, we could use the heatmap function within matplotlib, but I figured I would try out another library called seaborn. They have their own built-in heatmap function with some cool parameters. We need to modify some of the aspects of the chart, however. The xticks and yticks of the chart will correspond to the column headers in our dataframe, as it is a matrix. The following lines of code will aid spacing our ticks out and adjust them properly to the center of the grid spaces. A cool argument of seaborn’s heatmap is the ability to annotate the grid.

Let’s run our script and check out the output using the matplotlib heatmap.

As you can see it’s a bit messy since there are 505 assets in this matrix. We can actually zoom in using the magnifying glass icon in the matplotlib window.

What I wanted to do next is graphically compare times of turmoil within the market to times of rapid expansion which we previously graphed. It is already well understood that there is no hiding in recessionary time spans via diversification. We can conclude from that in times of expansion there would be much less correlation in terms of price. What I wanted to do here was numerically analyze the prior statement. We can also look at the strongest relationships throughout each time period, and output a simple average to give us an idea of just how correlated the market was in general.

Disclaimer: don’t try doing something like this for an index like the S&P 500 in Excel, your computer will crash. We can do this without any problems in Python.

Takeaway: We can clearly see when the market turned sour the constituents were far mar correlated just from optically looking at both charts. We could also test multiple economic downturns as well and graph each to see the extent to which the underlying assets were correlated.

We can also break this down by sector. There are 11 sectors by GICS (Global Industry Classification Standard) classification. This is a simplified and logical sector and industry classification standard which I use. Websites such as Bloomberg and Fidelity use the GICS standard also. The issue here will be getting the data for the sectors. Sources like Bloomberg make it easy, but I really don’t feel like paying $20,000+ for data. You could try examining this through representative ETF’s or downloading the sector data (XLE, XLF, etc.) via Yahoo Finance.

What about other major indices? How about the DOW? Or what about the NASDAQ. The Dow Jones Industrial Average is a price-weighted index consisting of 30 significant equity securities. Hence the word significant, we don’t have a clear sector or industry direction to adhere to. We can hypothesize that the NASDAQ will be largely correlated due to the overwhelming amount of technology constituents representing it. The NASDAQ index has been absolutely killing it lately. It is outperforming nearly every major U.S. index year to date.

Just for kicks, let’s see the year to date correlations for each index and output it graphically with the seaborn module.



A few things to keep in mind here. Correlation is time specific. These relationships change over time. This is a major flaw of the modern portfolio theory which is predicated on selecting uncorrelated assets.


About the author


Hi, I'm Frank. I have a passion for coding and extend it primarily within the realm of Finance.

View all posts


  • Hello Frank. I tried to run many of your scripts, and all worked fine, with the exception of this one for correlations of indicates. I copied and pasted all the code into Python 3.6 (Spyder). I’m getting this error: AttributeError: ‘correl’ object has no attribute ‘time_entry’

    Any idea what may be going on?

Leave a Reply

Your email address will not be published. Required fields are marked *