breastcancer_6

=__Breast Cancer Risk__=


 * //Age// || //Probability of Developing Breast Cancer// ||
 * 20 || 0.04 ||
 * 30 || 0.4 ||
 * 40 || 1.49 ||
 * 50 || 2.54 ||
 * 60 || 3.43 ||

By plotting our data, it looks like this:

As you can see it forms a straight trend. By testing this data, we assumed that this data would best fit a line graph, thus we calculated the line equation by hand first.

We first picked 2 points 1 -- (20, 0.04) 2 -- (60, 3.43)

By Substracting Y2 from Y1, and placing it over X2-X1, we got the slope.

3.43 - 0.04 = 3.39 60 - 20 = 40



Using the point slope form, we could eventually get the formula of ax+b

y = 0.08475 ( x - 60 ) + 3.43 y = 0.08475x -5.085 + 3.43

In the end, we get the line formula of ax+b as: y= 0.08475x - 1.655

By using the calculator and pressing LinReg, we were able to get another formula for our data. y= 0.0892x-1.988 [This equation is a pretty good fit as the coefficient correlation is 0.980, resembling a fit of 98%.]

residual sum = -.775

residual sum = 0

Although the residual sum is 0, the correlation coefficient is not 100%. The formula that was used to present the data used a method of finding a way to balance out how far the points of data was away from the line. Thus due to some "outliers" the graph found a path exactly in the middle of the points that were off. Which means, not all the points were able to be exactly on the point, and the correlation coefficient of this graph was 98%.

As you can see, both of the residual graphs don't show a certain pattern, thus the linear equations we found with both hand and calculator are good fits.

We also tried a quadratic function to fit out data. By using QuadReg in our calculator, we were able to come up with the equation:



This quadratic function was placed in to the calculator with the data and it looked something like this :

residual sum = 0.008 The correlation coeffiecient of this graph is 96%, thus meaning 96% of the data was presented by this data.

The residual graph looked some what like what we had with the line graph, similarly, they both don't show a pattern. Although the residual sum of the quadratic formula is quite good, 0.008, it is worse than the linear formula. Of course a residual sum is not the only thing that determines a good fit for the data, but by looking at the correlation coefficient, it was obvious that a 98% was better than a correlation coefficient of 96%, thus we can choose the linear formula over the quadratic, and work with the two linear functions that we have found earlier.

To make sure we had the best formula, we also tried a cubic function to see if it would work with the data, however, the equation we came up with from the calculator didn't present the data at all. By using CubicReg, it gave us an equation of



This cubic formula was not the best function for the data, seeing that it didnt even hit any of the points, or wasnt even close to the points. Thus a graph showing the relationship was hard to come up with because the line was too far off from the data itself, making the data really small and unclear due to the needs to see the cubic function.

As a result of all this investigation, we came to the conclusion that the linear function was still a best fit for the data based on the correlation coefficient.

By looking at the two linear functions we found earlier on in this investigation, we came to the conclusion that the equation obtained by the calculators "LinReg" best fit our data. This was because the only thing we could really base the validity of the function was by comparing the residual plots. By eye, it was too close to see which linear formula better fit the data, yet by looking at the residual plot graph, it was clear that one set of residuals were closer to the x-axis, meaning that the data was closer to the line. This residual graph was the equation found by the calculator.

By using the model that best fit our data, we can predict the risks of breast cancer as a woman's age increases to 70, 80, 90 and etc... If we wanted to figure out what the risks were for a lady of 120 (amazingly there are a few woman that live to an age of 120 years old) we could use the equation that we found and use the method of substitution. For example: y = 0.0892x-1.988 y = 0.0892(120) -1.988 y = 10.704 - 1.988 y = 8.716

Thus, a woman at the age of 120 years would have a 8.716% chance of getting breast cancer. In conclusion, as a female gets older, it's obvious that the risk of breast cancer increase as well. However, the increase is not very big, seeing that in 100 years time, a womans chance of getting breast cancer only increases about 8%. Thus breast cancer, although deadly, will not be popular among woman that are not already infected.

As you can see, our data already consisted of a linear data. Although we had one outlier at age 30 ( a fast increase of breast cancer risk) our group concluded that in the following years, the data seemed to stay consistent, which is why we chose the linear formula to predict the ages after 60.

We tried logging our data to find if our data was some what exponential or power. We did this by logging either x or y, and if the data straightened out even more (fixing the outlier) it would be exponential. Or, we could have logged both x and y, and if the data straightened out then it would have been power.





In conclusion, one can see from the results above, none of these other options straightened the data even more, thus proving again that the linear function was the right choice. The equation derived from LinReg in the calculator was : y= 0.0892x-1.988, with a percent fit of 98%, backing up our choice.

All three types of graph do not linearize the data well enough to trick the naked eye. The linear model fits the data the ebst because only one point is in the data is not covered by the model. The other graphs however will not fit this model because of it's simple linear form. When we attempted to make power or exponential models for this data, they provided not too bad of a fit, but the linear one was far superior. Here you can see that we do not in fact need to graph the function on semi-log or log-log paper because as we found out that the original function is in fact linear.

Related Links: [|Medical Informations - Breast Cancer]