Frequency Distribution
BY: M.A.SIRAJI
Frequency distribution: A frequency distribution is any arrangement of data that shows the frequency of occurrence of different values of the variable or the frequency of occurrence of values falling within arbitrarily defined ranges of the variable called class intervals.
Rules of thumb regarding arranging a set of data into class intervals:
 1. Select a class interval of such a size that between 10 and 20 such intervals will cover the total range of the observations.
 Select class intervals with a range of 3,5,10 or 20 points.
 Selecting class interval at a value which is multiple of the size of that interval.
 Arrange the class intervals in order of magnitude of values they include with the class of largest values on top.
Daily wage of day laborers:
500 50 170 350 200 275 100 80 75 400 450 700 1000 60
325 420 725 640 553 244 290 90 328 115 400 800 60 55
346 220 210 612 673 444 590 100 99 70 44 510 498 517 430
612 110 225 370 450 565 115 100 899 960 30 210 344 287
120 20 30 82 120 227 310 200 100 99 70 60 350 75 97
330 290 177 100 600 610 443 219 .
Highest wage= 1000
Lowest wage= 20
Difference = 980
Here N= 80
Class size= 50
Class Interval 
Tally 
Frequency (f) 
9601009 
 
2 
860909 
0 

860859 
 
1 
810859 
0 

760809 
 
1 
710759 
 
1 
660709 
 
3 
610659 
 
3 
560609 
 
3 
510559 
 
3 
460509 
 
2 
410459 
  
6 
360409 
 
2 
310359 
  
9 
260309 
 
5 
210259 
  
7 
160209 
 
4 
110159 
  
6 
60109 
    
18 
1059 
 
5 
Apparent and true/ exact limits of class intervals:
The values of the continuous variables fall within certain limits of the measurement scale. These limits are taken as one half unit bellow to one half unit above the apparent /reported one. For example an age value 19 may be thought of as occupying a range 18.5 to 19.5 along the measurement scale. These limits are referred to an exact/real true limit of continuous variables.
The class intervals in a frequency distributions are usually reported in exact limits that reflects accuracy of our measurement.
Class Intervals Exact limit
50 99 49.599.5
Midpoint of class point:
45 – 49
40 – 44
35 – 39
30 – 34
25 – 29
20 – 24
15 – 19
To obtain the class midpoint, add half of the class size to the lower exact limit of that class. From the class 30 – 34 we can see the midpoint is
29.5+2.5=32
Age distribution of married women of reproductive age (MWRA)
Class Interval 
Frequency (f) 
Exact Limit 
Class Midpoint 
45 – 49 
30 
44.5 – 49.5 
47 
40 – 44 
55 
39.5 – 44.5 
42 
35 – 39 
70 
34.5 – 39.5 
37 
30 – 34 
100 
29.5 – 34.5 
32 
25 – 29 
150 
24.5 – 29.5 
27 
20 24 
75 
19.5 – 24.5 
22 
15 – 19 
20 
14.5 – 19.5 
17 
n= 500 
Assumption about the distribution of observations within the class interval:
Age distribution of married women of reproductive age (MWRA)
Table – 1
Class Interval 
Frequency (f) 
45 – 49 
30 
40 – 44 
55 
35 – 39 
70 
30 – 34 
100 
25 – 29 
150 
20 24 
75 
15 – 19 
20 
Though we loss some of information by arranging the data into class intervals, we need to make certain assumption about the distribution of observations within the class intervals. There are two such assumptions.
Assumption 1: The 1^{st} assumption states that the observations are uniformly distributed over the exact limits of the class interval.
From table 1 we can see class 25 – 29 contains 150 frequencies where the class size is 5.
Here
Class interval 
frequency 
28.5 – 29.5 
30 
27.5 – 28.5 
30 
26.5 – 27.5 
30 
25.5 – 26.5 
30 
24.5 25.5 
30 
N= 150 
This assumption (#1) is used in the calculation of such statistics as the median, quartiles and the percentiles, and in the preparation of histograph.
Assumption 2: The 2^{nd} assumption states that the observations within the class intervals are concentrated at the midpoint of the class interval. I.e. the observations within a particular interval are the same as the mid value of that interval.
Class Interval 
Frequency 
Midpoint(X) 
45 – 49 
30 
47 
40 – 44 
55 
42 
35 – 39 
70 
37 
30 – 34 
100 
32 
25 – 29 
150 
27 
20 24 
75 
22 
15 – 19 
20 
17 
N= 500 
This assumption is used in the calculation of such statistics as the means, standard deviations and in the preparation of Frequency polygon.
Table 2
Class Interval 
Frequency 
Midpoint(X) 
Exact Limit 
Cf 
Cpf 
45 – 49 
30 
47 
44.5 – 49.5 
500 
100 
40 – 44 
55 
42 
39.5 – 44.5 
470 
94 
35 – 39 
70 
37 
34.5 – 39.5 
415 
83 
30 – 34 
100 
32 
29.5 – 34.5 
345 
69 
25 – 29 
150 
27 
24.5 – 29.5 
245 
49 
20 24 
75 
22 
19.5 – 24.5 
95 
19 
15 – 19 
20 
17 
14.5 – 19.5 
20 
4 
N= 500 


Cpf =×100
Cpf = absolute number
100 = relative number
Graphical representation of data: a graph is a visual portrait of a set of numerical data. By simplifying it helps our understanding. It compels our attention to the essential elements of a set of data
Most common form of graph:
 Histogram
 Frequency polygon
 Cumulative frequency polygon
 Cumulative percentage
 1. Histogram: Histogram is a graph where frequencies are represented by bars in the form of areas. The width of each bar corresponds to the exact limit of the class interval and the height of each bar corresponds to the class frequency
Exact class interval
 2. Frequency polygom
 Putting dot at the intersecting point of the midpoint of the class and the class frequency
 Next, joining the dots in order of the class intervals and connecting the lines to the base of the graph.
Midpoint
 3. Cumulative frequency polygon: Cumulative polygon differs from a frequency polygon in two respect ;
a) In this case, we put a dot corresponding to the cumulative frequency instead of class frequency.
b) While drawing a cumulative frequency polygon we put the dot corresponding to the exact upper limit to the class instead of the class midpoint.
Exact upper limit
How does frequency distribution differ?
Frequency distribution differs from one another in terms of four important properties:
 1. Central location: central location refers to a value near the centre of the distribution or at the point of greatest concentration of values/ observations.
 2. Variation: Variation refers to the extent of clustering of the values in a distribution about the content value
 3. Skewness : Skewness refers to the symmetry or asymmetry of a distribution . if a distribution is asymmetrical, then it can be either positively skewed( skewed to the right) or negatively skewed( skewed to the left)
 Kurtosis: Kurtosis refers to the flatness or peakdness of one distribution in relation to the other. In terms of kurtosis, there are three types of distributions
a) Leptokurtic
b) Platykurtic
c) Mesokurtic / normal distribution
Measure of central tendency
Central tendency: Central tendency refers to a tendency in the observations within a distribution to be clustered around a central value of that distribution.
Measures of central tendency: Measures of central tendency are those indexes through which the central tendency of a distribution can be quantified.
The most common measures of central tendency are the mean median and mode, harmonic mean, quadratic mean and geometrical mean.
 1. Arithmetic mean or mean:
Calculation from ungrouped data:
A=20 B=50 C=100 D=1000 E=220
=
=
= 278.
Calculation for grouped data:
Class Interval 
Frequency(f) 
Midpoint (X) 
fX 
130134 
1 
132 
132 
125129 
1 
127 
127 
120124 
3 
122 
366 
115119 
6 
117 
702 
110114 
7 
112 
784 
105109 
12 
107 
1284 
100104 
16 
102 
1692 
9599 
7 
97=A 
679 
9094 
17 
92 
1564 
8589 
5 
87 
435 
8084 
15 
82 
1230 
7579 
6 
77 
462 
7074 
3 
72 
216 
6560 
1 
67 
67 
N=100 
So, =
= =96.8
Again, = A+
Where, A= assumed mean
=
Class Interval 
Frequency(f) 
Midpoint (X) 
fX 
Cf  =

f 
130134 
1 
132 
132 
100  7  7 
125129 
1 
127 
127 
99  6  6 
120124 
3 
122 
366 
98  5  15 
115119 
6 
117 
702 
95  4  24 
110114 
7 
112 
784 
89  3  21 
105109 
12 
107 
1284 
82  2  24 
100104 
16 
102 
1692 
70  1  16 
9599 
7 
97=A 
679 
54 Mdn clss  0  0 
9094 
17 MOd 
92 
1564 
47  1  17 
8589 
5 
87 
435 
30  2  10 
8084 
15 
82 
1230 
25  3  45 
7579 
6 
77 
462 
10  4  24 
7074 
3 
72 
216 
4  5  15 
6560 
1 
67 
67 
1  6  6 
N=100 
So, = A+
97+ ×5=96.96
Median
Median calculation from grouped data:
Median = l +
Where,
l = exact lower limit of the median class
Fb = Sum of the frequencies below the median class
fm = frequency of median class
So, Median = 94.5 +
= 96.64
MODE
In case of raw data mode is the most frequently occurring observation. And in case of a frequency distribution mode is the mid point of the class having highest frequency.
Mode provides a nominal measure.
Mode = L +
Where,
L= exact lower limit of the mode class
The difference in frequency between the model class and its immediate higher class
The difference in frequency between the model class and its immediate lower class
সর্বোচ্চ frequency যে class এ সেটা model class
177 = 10
175 = 12
So, mode = 89.5 +×5 = 91.772
Again, mode = 3mdn – 2mean
= 3×96.64 – 2 × 96.8 = 96.32
Empirical relationship between Mean, Median and Mode
 If a distribution is normal then the mean, median and mode will lie at the same point
 If a distribution is not normal i.e. asymmetrical them the three will lie at different points with the mean pulled towards the skewed end.
Properties of Mean, Median and Mode
The mean: it is a measure of interval and ratio level variables.
The median: it is an ordinal level variable.
The mode: it is a measure of nominal level variable.
Arithmetical mean
Properties:
 1. The sum of deviations of all the measurement in a distribution from thir arithmetic mean is zero ( i.e. )=0)
 2. The sum of square of deviations from the arithmetic mean is less than the sum of squares of deviations from any other value i.e.
a) The second properties indicates that the arithmetic mean is the centre of gravity of a distribution
b) The 2^{nd} properties follows alternative definition of the arithmetical mean “the mean is that measure of central location about which the sum of the squares (of deviation) is a minimum (Artil and kalton).
c) The mean calculated from a sample of size “N” is an estimated population mean.
Advantage of mean
 The mean is based upon all the observation in a distribution and cannot be calculated even if a single value is missing. Therefore it is the most representative measure of central tendency
 The measure is not affected by sampling fluctuations. So it is the most stable measure of central tendency
 The mean is amenable to algebraic treatment i.e. combined mean of two or more distribution can be calculated.
The formula applied is:
_{c }= F= 30(N_{1}) = 65
= =66.667 M= 15 (N_{2}) = 70
Disadvantage of mean
a) The mean is unduly affected by extremely high or low values. Therefore it becomes a poor measure of central tendency when the distribution is skwed.
b) The mean cannot be calculated when the frequency distribution has open ended class at the both ends.
Median: advantage
A) The median can be calculated even when a distribution has open ended classes.
B) The median is not affected by extremely high or low values.
C) The median as a measure of central tendency is mostly need in markedly skewed distribution
D) Median: Disadvantage
a) The median is not amenable to algebraic treatment.
b) It is erratic (unpredictable).
Mode: advantage
a) It can be located/identified by more inspection
b) It is not necessary to know all items in a distribution to compute mode.
c) The mode is not affected by sampling fluctuations.
Mode: disadvantage
a) The mode is ill defined
b) It is not representative of distribution as it is not based on all the items
When to use mean, median and mode
The mean:
Use when
a) the measure of central tendency having the greatest stability is wanted. It usually varies items from sample to sample drawn from the same population
b) When other statistics (e.g. measure of variability) are to be calculated. Many statistics are based on the mean
c) The distribution of observation is symmetrical about the central
The median:
Use when
a) The exact midpoint of the distribution is wanted we are interested in whether classes fall within the upper or lower level of the distribution and not particularly in how far they are form the central point.
b) The distribution markedly skewed. Extreme values markedly affect the mean, not the median
c) An incomplete distribution if given
The mode:
Use when
a) A quick and very rough estimate of central value is wanted
b) We wish to know the most typical case of the distribution
VARIABILITY/DISPERSION
Variability: Variability is the degree to which the various observations in a distribution tend to spread about an average value
Inadequacy of averages : scores of two distribution
1) M: 12, 80,60, 14, 34 ; =40
2) F: 36,43, 42, 41, 38 ; =40
Measures of dispersion/ variability:
 Range
 Mean deviation
 Quartile deviation
 Variation and standard deviation etc are absolute measures
Coefficient of variation (CV): It is a relative measure. These measures help us to know the compactness, scalterdness of the observations within a distribution.
The range: Range is defined as the difference between the longest and the smallest values. Symbocally R= LS
Quartive deviance: IT is defined as the average distance of the quartile points from the median of the distributions. We get three quartile points. 1^{st} quartile is that point below which 25% of the observation lie, 2^{nd} quartile is that point below which 50% of the observation lie. 3^{rd} quartile is that point below which 75% of observation lie
Class Interval 
Frequency(f) 
Midpoint (X) 
fX 
Cf 
130134 
1 
132 
132 
100 
125129 
1 
127 
127 
99 
120124 
3 
122 
366 
98 
115119 
6 
117 
702 
95 
110114 
7 
112 
784 
89 
105109 
12 
107 
1284 
82 
100104 
16 
102 
1692 
70 
9599 
7 
97 
679 
54 
9094 
17 
92 
1564 
47 
8589 
5 
87 
435 
30 
8084 
15 
82 
1230 
25 
7579 
6 
77 
462 
10 
7074 
3 
72 
216 
4 
6560 
1 
67 
67 
1 
N=100 
Mean deviation: Mean deviation is the arithmetic mean of the absolute deviation of the scores from the mean of the deviation
X values 
X 
67 
15.6 
33 
18.4 
45 
6.4 
50 
1.4 
62 
10.6 
= 51.4 
= 52.4 
Calculation of MD:
Raw data:
MD = ; x=X
= = 10.48.
Grouped Data: MD =
Class Interval 
Frequency(f) 
Midpoint (X) 
fX 
Cf 
x=X 
fx 
130134 
1 
132 
132 
100 
35.2 

125129 
1 
127 
127 
99 
30.2 

120124 
3 
122 
366 
98  
115119 
6 
117 
702 
95  
110114 
7 
112 
784 
89  
105109 
12 
107 
1284 
82  
100104 
16 
102 
1692 
70  
9599 
7 
97 
679 
54  
9094 
17 
92 
1564 
47  
8589 
5 
87 
435 
30  
8084 
15 
82 
1230 
25  
7579 
6 
77 
462 
10  
7074 
3 
72 
216 
4  
6560 
1 
67 
67 
1  
N=100 
Here
= = 96.8
MD =
X= any raw data or midpoint ; x = deviation from the mean X
Variance and standard deviation:
Variance: Variance is the squared deviations from the mean of the distribution
Formula 

Sample 
population 
S^{2 }= Where x^{2}=( X)^{2} S^{2 }=

σ^{2}= Where x^{2}=( Xµ)^{2} σ^{2}= µ= population mean 
In both case we can say (N1) as digress freedom or unbiased estimate.
Height (in feet
X 
x=( X) 
x^{2} 

5.5 
0.06 
0.0036 

5.8 
0.36 
0.1296 

5.2 
0.24 
0.0576 

5.4 
0.04 
1.610^{03} 

5.3 
0.14 
0.0196 

=5.44 
=0.212 

S^{2 }= = = 0.053
Regression
BY: M.A.SIRAJI
Regression: Regression refers to a problem of predicting one unknown variable from the known variable or several variables
Regression is of two types:
 Simple regression. X Y
 Multiple regression(X,Y) Z
Simple regression/simple linear regression: when we can predict one variable from only one other variable.
Suppose X, Y are two variables where Y is dependent and X is independent variable.
Independent variable regarded as predictor in regression.
Dependent variable is regarded as criterior in regression.
Regression equation:
Regression equation of Y on X: = a_{yx}+b_{yx}X
Regression equation of X on Y: = a_{yx}+b_{yx}Y
= predicted Score of Y
Coefficient of determination
r= coefficient of correlation.
r^{2 = }coefficient of determination
= total variation (TV)
=unexplained variation (UV)
=explained variation (EV)
TV= UV+ EV
Properties of r^{2}:
r= 0.5; r^{2}=0.25
r= 0.5; r^{2}=0.25
 The value of r^{2 }is always positive.
 The values of r^{2 }from o to 1.
 The proportion of total variation can be explained in terms of the magnitude of correlation coefficient.