The p-value and its limitations in decision making

Artemas Wang
3 min readDec 21, 2020

--

Whenever I use Excel, R, or any other computer packages to analyze data, one of the key outputs is the p-value. This value, the probability-value, tells me how likely it is to get a result if the null hypothesis is true. But what is it really and how can this value be used in decision-making? In this post, I’ll be discussing how I use the p-value in data analysis and its benefits and limitations when using it to make decisions.

To examine the p-value in a real life scenario, I used a sample data set from Kaggle. The data represents the total number of Citibike rentals in New York City in a given calendar year. It includes variables such as the month and the season people rented the Citibikes, the time of day, and the temperature at the time of rental. A snippet of the data can be viewed below:

Sample Kaggle Data

Based on the data sample above, there are many questions I could ask of the data. Some questions might be:

  • Whether there exsists a relationship between temperature and the amount of bike rentals?
  • Whether the time of day effects how many bikes are rented?

Considering this data and the insights I could potentially extract from it, I can begin a test, the p-value test for statistical significance through a regression model. Let’s assume the following hypotheses:

Null Hypothesis — There exists no relationship between the number of bike rentals and temperature/climate.

Alternative Hypothesis — There does exist a relationship between the number of bike rentals and temperature/climate.

To begin my discussion about the p-value, I ran the linear model function in R, and regressed the count (number of rentals) and the temp (temperature) from the data, citibike.

I then used the summary() function to procure these statistical measures:

Summary Statistics

The closer the p-value is to zero, the more significant the relationship is. In R’s summary statistics, it simplifies the analysis for us by placing asterisks (*) at each level of significance. And as the data implies, the amount of Citibikes rented is highly correlated to the temperature at the time of rental. Thus, we can reject the Null Hypothesis, that there exists no relationship between temperature and the number of bike rentals and confidently say that temperature and bike rentals are correlated.

This example is one advantage of using the p-value test; it tells us whether something is or isn’t statistically significant. But we all knew that right? If it’s cold outside I’d most likely not want to ride a bike. If the weather is nice, I’d more than likely want to enjoy it with a leisurely ride.

In my opinion, the p-value is advantageous in that it tells us in one single numerical value whether there exists such a relationship, but fails to identify the nuances of the relationship. Therefore, while the p-value is great when it comes to acknowledging that relationships do exist, it fails to tell us why these relationships exist, thus limiting our understanding of the data when making key decisions.

--

--

Artemas Wang

My interests revolve around employing data mining, analytics, and applied machine learning to measure and predict outcomes.