Probability

During the course of the semester, we will be using probability and statistics in our study of genetics. This first laboratory exercise will review the principles that will be used throughout the course. For some students, this will serve as a refresher, while for others, it may be completely new material.

In the examples given below, assume that a coin has a head and a tail side, each of which is equally likely to be obtained when the coin is tossed. Also, a deck of cards is a standard deck with 52 cards (no jokers), 13 cards in each of four suits. The cards within a suit are ace, 2, 3, 4, 5, 6, 7, 8, 9, 10, jack, queen and king, and the suits are clubs, diamonds, hearts and spades.

The probability (p) of an event occurring is calculated by the frequency of the event (e) divided by the total of all possible occurrences (n)

p = e/n

For instance, the probability of selecting the ace of spades is 1/52. The probability of selecting ANY ace is 4/52 or 1/13; and the probability of selecting ANY spade is 13/52 or 1/4.

Sum Rule

The sum rule is used when considering the probability of either of two mutually exclusive events. If the verbal expression is 'A or B,' the 'or' clues you in that the sum rule is applied. In this case, the individual probabilities are added.

pA or B = pA + pB

For example, the probability of selecting the three of clubs or any ace from the deck is the sum of the individual probabilities:

probability of selecting a three of clubs = 1/52
probability of selecting any ace = 4/52
total probability = 1/52 + 4/52 = 5/52.

Product Rule

The product rule is used when two events occur simultaneously (or consecutively). The general verbal formula is 'A and B.' In this case, total probability of both events occurring is the product of the two individual events. There are a few tricky applications which will be considered following the examples.

What is the probability of tossing a coin twice and obtaining heads both times?

The probability of obtaining heads on each coin flip is ½; therefore the total probability is: ½ x ½ = ¼.

Assuming there is an equal chance of having a boy or a girl, what is the probability in a family with 7 children, that all 7 will be girls?

(½)7 = 1/128

Now for the tricky calculations:

What is the probability that you will draw an ace of spades and a king of hearts when you draw 2 cards from the deck? The term 'and' indicates that the product rule should be used. As mentioned above, the probability of drawing the ace of spades is 1/52. But the first card could be either the ace of spades OR the king of hearts.  Thus, the probability of drawing one of the two cards first is

1/52 + 1/52 = 1/26

Assuming you are holding the first card when you draw the second card, the probability of drawing the second card specified  is 1/51. (Remember that one of the cards has been removed from the deck!) Thus, the total probability is:
1/26 x 1/51 = 1/1326

What is the likelihood, in a family with 7 children, that there will be one boy and six girls? You could use the same formula as above, but you must keep in mind that there are 7 ways to have one boy and six girls. (The boy could be first, second, third, fourth, fifth, sixth or last in birth order.) Because there are 7 mutually exclusive possibilities, the sum rule comes into play (note the 'or' in the listing above). Thus the probability of six girls and one boy in a family with seven children is 7/128.

Binomial Expansion and Pascal's Triangle

If you consider the likelihood, in a family with seven children, of having 3 girls and 4 boys, the calculations become a little more difficult. Exactly how many different ways can you have 3 girls and 4 boys? Take a couple of minutes (no more!) to list them; attempt to use a method of logic that will prevent you from repeating any of the combinations.

_______ _______ _______ _______ _______ _______

_______ _______ _______ _______ _______ _______

_______ _______ _______ _______ _______ _______

_______ _______ _______ _______ _______ _______

_______ _______ _______ _______ _______ _______

_______ _______ _______ _______ _______ _______

_______ _______ _______ _______ _______ _______

How many could you list? Did you develop a pattern? Could you easily determine combinations for larger families?  Would you like to know an easier way?

Perhaps you will recall binomial expansion from algebra:

(a + b)0 = 1

(a + b)1 = a + b

(a + b)2 = a2 + 2ab + b2

(a + b)3 = a3 + 3a2b + 3ab2 + b3

(a + b)4 = a4 + 4a3b + 6a2b2 + 4ab3 + b4

(a + b)5 = a5 + 5a4b + 10a3b2 + 10a2b3 + 5ab4 + b5

(a + b)6= a6 + 6a5b + 15a4b2 + 20a3b3 + 15a2b4 + 6ab5 + b6

(a + b)7 = a7 + 7a6b + 21a5b2 + 35a4b3 + 35a3b4 + 21a2b5 + 7ab6 + b7

You may have heard of Pascal's triangle, which is made up of the coefficients in front of each term in the expanded form:

 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 1 6 15 20 15 6 1 1 7 21 35 35 21 7 1

If you look at the triangle, the outside numbers are always one. The other values can be calculated as the sum of the two numbers on either side of it in the row above. Calculate the values for the next row in Pascal's triangle.

___ ___ ___ ___ ___ ___ ___ ___ ___

In fact, each term in the binomial expansion can be calculated using the formula:

n!          (pxqn-x)
x!(n-x)!

Where n! = n x (n-1) x (n-2) x .... x 1

The term n! is read 'n factorial.'

Going back to the initial problem (ways of having 3 girls and 4 boys in a family of 7 children), let

n = total number of children
p = probability of having a girl
q = probability of having a boy
x = number of girls
n - x = number of boys

Then the coefficient
n!
x!(n-x)!

represents the number of combinations of 3 girls and 4 boys in a family of 7.

If you don't have a calculator that handles factorials, there is a quick way of calculating them. Let's solve for the coefficient with 3 girls and 4 boys:

7!     =      7 x 6 x 5 x 4!            =   7 x 6 x 5                          =   7 x 5 = 35
3!4!           (3 x 2 x 1) x 4!                      6

expand ;         cancel 4!,                 simplify, cancel 6                     solve
in numerator and     in numerator and
denominator             denominator

Thus, there are 35 combinations of 3 girls and 4 boys in a family of 7 children. (How many were you able to list??)

The usefulness of this formula will become apparent when we begin analyzing pedigrees. If, for instance, we have a family in which both parents are carriers for the recessive gene for cystic fibrosis (Cc), and we want to know the probability that their first child will have cystic fibrosis (cc), we can readily determine that the probability of each parent providing the defective gene is ½, and since both parents must contribute the gene, we use the product rule to determine the probability is ½ x ½ = ¼. But what is the probability that 2 of their 3 children will have cystic fibrosis? Use the formula to calculate the probability:

n = number of children
p = probability of having cystic fibrosis
q = probability of not having cystic fibrosis
x = number of children with cystic fibrosis
n - x = number of children not having cystic fibrosis

What is the probability that two of the three children WILL have cystic fibrosis? ______

What is the probability that two of the three children WILL NOT have cystic fibrosis? ____

Why is there a difference in these two values?

Chi Square Analysis

If you were to flip a coin 100 times, how many times would you expect to obtain heads? __ Would you be surprised if there were some slight variations from this value? ____

If a magician were flipping the coin and obtained heads 50% of the time, would you be suspicious that something unusually were occurring? ___

What about 75% of the time? ___

If the magician ALWAYS got heads, 100 times in a row, would you be suspicious? ___

We expect that there will be chance deviations from the expected values, but sometimes the variations are large, and are due to something beyond chance. Sample size has an impact; it is common in a family with two children to have all boys. It is not as common in a family of twenty children to have all boys! This is why a larger sample yields results with greater validity.

The chi square test is used to determine if the deviations are within a range considered to be normal, or if they are so different than what we expected that we must consider that something other than chance is involved.

One method of determining whether variations from the expected are reasonable is the chi square test (see table 1). The observed results are what is measured. The expected results are what was predicted. These should be in the same units. For instance, if you expect ¼ of families with two children to have all boys, and you poll 100 families with 4 children, you would expect ¼ x 100 or 25 families to have all boys. One way of double checking if the observed and expected values are in the same units is to see if the sum of each of the columns is the same. The deviation is the difference between the observed and expected values (O - E). Note that the sum of all the deviations is always zero. This value isn't very useful, since the negative deviations cancel the positive ones. Therefore

 category observed (O) expected (E) (O - E) (O - E)2 (O - E)2/E 2 boys 32 25 7 49 1.96 1 boy; 1 girl 46 50 -4 16 0.32 2 girls 22 25 -3 9 0.36 total 100 100 0 2 = 2.64

the squared deviation [(O - E)2] is used. This eliminates all negative values. This is also not very useful, as it doesn't take into consideration sample size. (A deviation of two is a LOT in a sample size of five, but not very much in a sample size of a thousand.) Therefore, the squared deviation is divided by the expected value to obtain a measure of the relative size of these variations from the predicted values [(O - E)2/E]. These are summed to obtain the chi square (2) value.

The chi square value is then found on a chart to determine a range of p (probability of variation due to chance alone) values (table 2). The degrees of freedom are the minimum number of values in a data set that must be known in order to determine the values of the remaining classes. In the example of families with 2 children, if you were told to collect the data from 100 families, and you knew that 32 families had two boys and 46 families had a boy and a girl (in any order), you could determine that 22 families had two girls. In fact, if you know any two classes, you can determine the third class. The value of the third class depends on the value of the first two, which are independent variables. The degrees of freedom is a measure of the number of independent variables, which in this example is two. For the most part, the degrees of freedom is one less than the number of classes, though when we discuss population genetics, you will discover this general rule does not apply.

A p value of .99 means that there is a 99% chance that the variation from the expected is due to chance alone. A p value of .05 means there is a 5% chance that variation is due to chance alone (and a 95% chance that something other than chance has caused the variation). Typically, a p value greater than 0.05 is accepted as variation due to chance alone. Any time the p value is less than 0.05, it is assumed that something other than chance is involved. Usually the chi square value falls between two p values. Therefore the p value is listed in a range. In the example given in table 1, the 2 value is 2.64 and there are 2 degrees of freedom. Thus, the p value lies between 0.20 and 0.50. It is written as:

0.50 > p > 0.20

This means that there is a 50-80% chance that the variation in the values is due to chance alone, and not some other factor. The hypothesis (that in a family with two children, ¼ will have two boys, ½ will have a boy and a girl, and ¼ will have two girls) is accepted.

Worksheet Name: _____________________________

Using H for heads and T for tails, list all different possible results of flipping three coins (the order of the coins matters):
_______________________________________________________________________

Now, how many ways are there of getting three heads? ___ two heads and a tail? ___ two tails and a head? ___ three heads? ___ How many different possibilities are there? ___

Flip three coins ten times and record your results:

 trial 1 2 3 4 5 6 7 8 9 10 coin 1 coin 2 coin 3

Now, tally the data:

 your data class data 3 H 2 H, 1 T 1 H, 2 T 3 T

Complete a chi square analysis using your data:

 category observed (O) expected (E) (O - E) (O - E)2 (O - E)2/E 3 H 2 H, 1 T 1 H, 2 T 3 T total ----- 2 =

df = ________ ___ > p > ____ Accept or reject hypothesis?

Complete a chi square analysis using class data:

 category observed (O) expected (E) (O - E) (O - E)2 (O - E)2/E 3 H 2 H, 1 T 1 H, 2 T 3 T total ----- 2 =

df = ________ ___ > p > ____ Accept or reject hypothesis?

Was there a difference in your p values for your data and the class data? ___ If so, explain why. ____________________________________________________________________

_______________________________________________________________________

In Mendelian genetics, typical expected ratios for a cross involving a single trait are 1:0, 1:1 and 3:1. In a dihybrid cross, the expected ratio is 9:3:3:1, though sometimes modified ratios are obtained because of epistasis. This will be discussed in detail in class, but standard ratios include 15:1, 13:3, 9:7 and 12:3:1. [Note that all of these ratios add up to 16 parts.] We will be using beans of two colors to develop and test hypotheses for best fit. You will be given three bags of beans (A, B and an unknown, which is numbered). You are to sort and count the beans, then develop a hypothesis as to the best Mendelian ratio (or modified ratio) that fits the data. The easiest way to make an educated guess is to determine the fraction of beans that are each of the two colors (white/total and brown/total); then multiply these decimals by 16 (since the ratios all involve 16 parts). Your values will most likely NOT be whole numbers, but, when rounded to whole numbers, should give you a 'best guess' on which to base your hypothesis. Once you have developed a hypothesis as to the appropriate ratio, do a chi square analysis to see if the variation is due to chance alone.

 bean color bag A bag B unknown # ___ white brown total

Bag A: predicted ratio: ____ white: ____ brown

 category observed (O) expected (E) (O - E) (O - E)2 (O - E)2/E white brown total ----- 2 =

df = ________ ___ > p > ____ Accept or reject hypothesis?

Bag B: predicted ratio: ____ white: ____ brown

 category observed (O) expected (E) (O - E) (O - E)2 (O - E)2/E white brown total ----- 2 =

df = ________ ___ > p > ____ Accept or reject hypothesis?

Unknown #___: predicted ratio: ____ white: ____ brown

 category observed (O) expected (E) (O - E) (O - E)2 (O - E)2/E white brown total ----- 2 =

df = ________ ___ > p > ____ Accept or reject hypothesis?

The bags of beans were prepared as follows. To obtain a ratio of X white: Y brown beans, X tablespoons of white beans were mixed with Y tablespoons of brown beans. The assumption was made that the beans were the same size. Based on the results of your studies, was this a reasonable assumption? Why or why not?

In one school system, there are 472 families that have five children. Do a chi square analysis to determine if the variation in the distribution is due to chance alone.

 category observed (O) expected (E) (O - E) (O - E)2 (O - E)2/E 5 girls 19 4 girls, 1 boy 81 3 girls, 2 boys 162 2 girls, 3 boys 135 1 girl, 4 boys 69 5 boys 6 total 472 ----- 2 =

df = ________ ___ > p > ____ Accept or reject hypothesis?

What is the total number of girls in all the families? ___
What is the total number of boys in all the families? ___

One of the schools in the district is a girls' boarding school. How has the data from this school affected the results?

What is the probability of picking a red 2 or a black 3 from a deck of cards? Show your work.

What is the probability of picking a six and an eight from the deck of cards? Show your work.

Last Updated: August 24, 2001
Site map: Margaret F. Hicks Home - Biology 2120 - Notes - Probability Lab

Pellissippi State Technical Community College