An Introduction To Statistical Regresion Analysis
For some, October means playoff baseball. For others, October means its time for football. For even fewer, October means its time to start preparing for the next baseball season. Such is the life of a baseball analyst. Therefore, I’d like to kick off my 2009 analysis with an introduction to some statistical methodology I may reference over the coming months. I’d like to introduce you to a friend of mine: statistical regression analysis. Essentially, a statistical regression analysis allows us to determine the exact relationship between two sets of data. In many cases, determining this relationship will allow us to predict future performance. There are two key terms you need to know in order to understand the analysis that follows.
The first is Pearson’s product-moment correlation coefficient (there will be a quiz on this later), or more commonly referred to as r. r is the statistical measure of the correlation between two sets of data. In other words, r calculates the degree to which two variables are related. A strong r value is close to 1 or -1, a weak r value is close to 0, and an r value between +/- .3 and +/- .7 is considered moderate. Therefore, if two sets of data have an r value of -.9, they are strongly associated with a negative correlation. R is calculated using the following formula:

The second statistic is the square of r. The square of r is conventionally used as a measure of the association between X and Y. For example, if r squared is 0.50, then 50% of the variance of Y can be "accounted for" by changes in X and the linear relationship between X and Y. r squared is an especially important statistic due to the fact that just because two statistics have a seemingly linear relationship, does not necessarily mean they have the relationship due to actual variance in the two sets of data.
Hopefully I’ve explained these concepts thoroughly enough for you to understand the following example. After I’ve demonstrated these statistics in action, we can apply the statistics to a more pressing issue. First, let’s take a look at the relationship between wins and expected wins. Expected wins is a formula calculated as RS^1.82/((RS^1.82)+(RA^1.82)), where RS represents runs scored and RA represents runs against. The formula is calculated after the games have been played; it does not predict future performance. Clearly, there should be a strong, positive correlation between wins and expected wins. This would indicate that the formula is effective. The following analysis is for the 2008 season:

As you can see from the graph, there appears to be a fairly strong, positive correlation (i.e. the data slopes upward in a relatively linear pattern). Using the statistics mentioned earlier, the relationship is quantified as:
r: .920993
r squared: .848228
Clearly there is a strong linear relationship between wins and expected wins, as quantified by an r value of approximately .92. The r squared value of approximately .85 indicates that approximately 85% of the change in the expected wins can be accounted for by the change in wins. In other words, expected wins predicted wins in 2008 with approximately 85% accuracy.
The strong linear relationship in the above example was to be expected because the formula for expected wins was devised to predict wins. Now, let’s apply the same statistics to a different set of data. This time, let’s compare 2008 payrolls to 2008 wins. In this instance, an r value close to 1 would indicate that there is a strong correlation between the amount of money a team spends and how many wins they earn and vice versa. The following graph displays the relationship between the two sets of data:

As you can see in the graph, there appears to be a slight positive correlation; however, for the most part, the data appears to be random. This seems to indicate that there is only a slight correlation between how much money a team spends and how many wins they earn. Statistically, the relationship is as such:
r: .323286
r squared: .104514
The r value of .323286 indicates that there is a moderately weak correlation between 2008 payrolls and 2008 wins. Furthermore, the r squared value of approximately .10 indicates that only 10% of how many wins a team accumulates can be attributed to how much money they spend. Teams such as the Rays have proven that you can win without spending a large sum of money. Clearly, other teams such as the Yankees need a crash course in statistics.
A larger study of the historical correlation between payrolls and wins would be needed to illuminate any long-term patterns; however, for our purposes, this is quite an informative study. With the above statistics, we can begin to dissect the 2008 season and discover quite a few patterns that can help you while preparing for the 2009 fantasy baseball season.
The first is Pearson’s product-moment correlation coefficient (there will be a quiz on this later), or more commonly referred to as r. r is the statistical measure of the correlation between two sets of data. In other words, r calculates the degree to which two variables are related. A strong r value is close to 1 or -1, a weak r value is close to 0, and an r value between +/- .3 and +/- .7 is considered moderate. Therefore, if two sets of data have an r value of -.9, they are strongly associated with a negative correlation. R is calculated using the following formula:

The second statistic is the square of r. The square of r is conventionally used as a measure of the association between X and Y. For example, if r squared is 0.50, then 50% of the variance of Y can be "accounted for" by changes in X and the linear relationship between X and Y. r squared is an especially important statistic due to the fact that just because two statistics have a seemingly linear relationship, does not necessarily mean they have the relationship due to actual variance in the two sets of data.
Hopefully I’ve explained these concepts thoroughly enough for you to understand the following example. After I’ve demonstrated these statistics in action, we can apply the statistics to a more pressing issue. First, let’s take a look at the relationship between wins and expected wins. Expected wins is a formula calculated as RS^1.82/((RS^1.82)+(RA^1.82)), where RS represents runs scored and RA represents runs against. The formula is calculated after the games have been played; it does not predict future performance. Clearly, there should be a strong, positive correlation between wins and expected wins. This would indicate that the formula is effective. The following analysis is for the 2008 season:

As you can see from the graph, there appears to be a fairly strong, positive correlation (i.e. the data slopes upward in a relatively linear pattern). Using the statistics mentioned earlier, the relationship is quantified as:
r: .920993
r squared: .848228
Clearly there is a strong linear relationship between wins and expected wins, as quantified by an r value of approximately .92. The r squared value of approximately .85 indicates that approximately 85% of the change in the expected wins can be accounted for by the change in wins. In other words, expected wins predicted wins in 2008 with approximately 85% accuracy.
The strong linear relationship in the above example was to be expected because the formula for expected wins was devised to predict wins. Now, let’s apply the same statistics to a different set of data. This time, let’s compare 2008 payrolls to 2008 wins. In this instance, an r value close to 1 would indicate that there is a strong correlation between the amount of money a team spends and how many wins they earn and vice versa. The following graph displays the relationship between the two sets of data:

As you can see in the graph, there appears to be a slight positive correlation; however, for the most part, the data appears to be random. This seems to indicate that there is only a slight correlation between how much money a team spends and how many wins they earn. Statistically, the relationship is as such:
r: .323286
r squared: .104514
The r value of .323286 indicates that there is a moderately weak correlation between 2008 payrolls and 2008 wins. Furthermore, the r squared value of approximately .10 indicates that only 10% of how many wins a team accumulates can be attributed to how much money they spend. Teams such as the Rays have proven that you can win without spending a large sum of money. Clearly, other teams such as the Yankees need a crash course in statistics.
A larger study of the historical correlation between payrolls and wins would be needed to illuminate any long-term patterns; however, for our purposes, this is quite an informative study. With the above statistics, we can begin to dissect the 2008 season and discover quite a few patterns that can help you while preparing for the 2009 fantasy baseball season.


