Borui Academy

Chapter 12

Statistics

统计 · mean · variance · frequency distributions

Unit 12 · Statistics 统计

By the end of this chapter you can:

  1. Calculate the mean, median, and mode of a data set and explain when each measure is most appropriate
  2. Compute variance σ2\sigma^{2} and standard deviation σ\sigma from raw data using the population formula
  3. Construct and read a frequency distribution table and convert it into a frequency density histogram
  4. Identify the effect of outliers on mean vs. median and choose the right central measure for skewed data

Exam weight on past CSCA papers: ~3% (1–2 of 48 MCQs). Quick high-yield wins.


12.1 Descriptive Statistics 描述统计

Statistics begins with a data set — a list of nn numerical values x1,x2,,xnx_1, x_2, \ldots, x_n. Three numbers summarise where the "centre" is.

Mean 平均数

xˉ=x1+x2++xnn=1ni=1nxi\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum_{i=1}^{n} x_i

The mean is the balance point of the data. Every value contributes equally, so a single extreme value (outlier 异常值) can pull the mean far from the rest of the group.

Median 中位数

First sort the values in ascending order (从小到大排列), then:

nn Median position
odd the single middle value at position n+12\dfrac{n+1}{2}
even the average of the two middle values at positions n2\dfrac{n}{2} and n2+1\dfrac{n}{2}+1

The median is resistant to outliers — one extreme value cannot shift it by more than one rank.

Mode 众数

The value that appears most frequently. A data set may have one mode, more than one mode (bimodal / multimodal), or no mode if all values appear equally often.

Central tendency at a glance

Measure Formula / definition Sensitive to outliers?
Mean 平均数 xˉ=xin\bar{x} = \dfrac{\sum x_i}{n} Yes
Median 中位数 middle of sorted list No
Mode 众数 most frequent value No

⚠️ Mean vs. median when outliers appear:
If one student scores 5 on a test where everyone else scores 80–90, the mean drops sharply while the median barely moves. CSCA problems often ask which measure "better represents" the data (更能代表数据). Answer: when data is skewed or has clear outliers, the median is a better representative.

🔑 Sort first, always. The exam almost always gives you an unsorted list on purpose. Sort before computing median or identifying mode.

Worked Example 12.1.A

The scores of 7 students are: 72, 85, 90, 60, 85, 78, 572,\ 85,\ 90,\ 60,\ 85,\ 78,\ 5.

Find the mean, median, and mode. Which measure best represents the class?

Solution.

Step 1 — Sort: 5, 60, 72, 78, 85, 85, 905,\ 60,\ 72,\ 78,\ 85,\ 85,\ 90.

Step 2 — Mean:
xˉ=5+60+72+78+85+85+907=475767.9\bar{x} = \frac{5 + 60 + 72 + 78 + 85 + 85 + 90}{7} = \frac{475}{7} \approx 67.9

Step 3 — Median: n=7n = 7 (odd), so position 7+12=4\dfrac{7+1}{2} = 4. The 4th sorted value is 78\mathbf{78}.

Step 4 — Mode: 8585 appears twice; all other values appear once. Mode =85= \mathbf{85}.

Which best represents the class?
The score of 55 is an outlier (e.g. absent or unwell student). The mean 67.9\approx 67.9 is dragged well below the main cluster of 60609090. The median =78= 78 better represents a typical student's result.

xˉ67.9,median=78,mode=85\boxed{\bar{x} \approx 67.9,\qquad \text{median} = 78,\qquad \text{mode} = 85}

⚠️ Even nn trap: When n=6n = 6, for instance, students often pick just one middle value. Always average positions 3 and 4.


12.2 Variance and Standard Deviation 方差与标准差

The mean tells us where the centre is; variance (方差) measures how far data spreads around that centre.

Population variance 方差

σ2=1ni=1n(xixˉ)2\sigma^{2} = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^{2}

In words: compute each data value's deviation from the mean, square it, then average all the squares.

Standard deviation 标准差

σ=σ2=1ni=1n(xixˉ)2\sigma = \sqrt{\sigma^{2}} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^{2}}

Standard deviation has the same units as the original data, making it more interpretable than variance.

Key properties

Property Meaning
σ20\sigma^{2} \ge 0 Squared terms are never negative
σ2=0\sigma^{2} = 0 All data values are equal to the mean
Larger σ2\sigma^{2} Data is more spread out (more scattered / 离散程度大)
Smaller σ2\sigma^{2} Data is tightly clustered around the mean

🔑 Shortcut formula — mean of squares minus square of mean:
σ2=1nxi2xˉ2=x2xˉ2\sigma^{2} = \frac{1}{n}\sum x_i^{2} - \bar{x}^{2} = \overline{x^{2}} - \bar{x}^{2}
This avoids computing (xixˉ)(x_i - \bar{x}) for each value and is faster on the exam.

⚠️ Population vs. sample formula:
The CSCA syllabus uses the population formula with denominator nn. The sample variance (样本方差) uses n1n-1 (Bessel's correction). Unless the problem explicitly says "sample variance," use denominator nn.

Worked Example 12.2.A

Five students' scores are: 70, 75, 80, 85, 9070,\ 75,\ 80,\ 85,\ 90. Find σ2\sigma^{2} and σ\sigma.

Solution.

Step 1 — Mean:
xˉ=70+75+80+85+905=4005=80\bar{x} = \frac{70+75+80+85+90}{5} = \frac{400}{5} = 80

Step 2 — Deviations and squared deviations:

xix_i xixˉx_i - \bar{x} (xixˉ)2(x_i - \bar{x})^{2}
70 10-10 100100
75 5-5 2525
80 00 00
85 +5+5 2525
90 +10+10 100100
Sum 00 250250

Step 3 — Variance:
σ2=2505=50\sigma^{2} = \frac{250}{5} = 50

Step 4 — Standard deviation:
σ=50=527.07\sigma = \sqrt{50} = 5\sqrt{2} \approx 7.07

σ2=50,σ=52\boxed{\sigma^{2} = 50,\qquad \sigma = 5\sqrt{2}}

Verification using the shortcut:
x2=702+752+802+852+9025=322505=6450\overline{x^{2}} = \frac{70^{2}+75^{2}+80^{2}+85^{2}+90^{2}}{5} = \frac{32250}{5} = 6450
σ2=6450802=64506400=50\sigma^{2} = 6450 - 80^{2} = 6450 - 6400 = 50 \checkmark


12.3 Frequency Distributions 频率分布

When a data set is large, we group values into class intervals (组距) and summarise with a frequency distribution table and histogram.

Frequency distribution table 频率分布表

Class interval (区间) Frequency ff (频数) Relative frequency 频率 f/nf/n
[a1,a2)[a_1,\, a_2) f1f_1 f1/nf_1/n
[a2,a3)[a_2,\, a_3) f2f_2 f2/nf_2/n
\vdots \vdots \vdots
  • Frequency 频数 ff: raw count of values in the interval.
  • Relative frequency 频率 f/nf/n: proportion of all data in that interval; all relative frequencies sum to 11.
  • Class width 组距 hh: length of each interval.

Frequency density histogram 频率/组距直方图

On a standard CSCA histogram the vertical axis is frequency density (频率/组距), not raw frequency:

Frequency density=Relative frequencyClass width=f/nh\text{Frequency density} = \frac{\text{Relative frequency}}{\text{Class width}} = \frac{f/n}{h}

The area of each bar equals the relative frequency of that class:
Bar area=frequency density×h=f/nh×h=fn\text{Bar area} = \text{frequency density} \times h = \frac{f/n}{h} \times h = \frac{f}{n}

All bar areas together sum to 11.

🔑 Reading a histogram: The yy-axis label "频率/组距" is your signal that bar height \ne relative frequency. You must multiply height by class width to get the relative frequency, then multiply by nn to recover the count.

⚠️ Unequal class widths: If intervals have different widths, bars of equal relative frequency will have different heights. Frequency density corrects for this — wider intervals get shorter bars, keeping areas honest.

Worked Example 12.3.A

A sample of n=40n = 40 measurements is grouped as follows. All class widths are h=10h = 10.

Interval Frequency ff
[10,20)[10,\, 20) 4
[20,30)[20,\, 30) 12
[30,40)[30,\, 40) 16
[40,50)[40,\, 50) 8

(a) Complete the relative frequency and frequency density columns.

(b) Which bar is tallest in the histogram?

(c) How many values are at least 30?

Solution.

(a)

Interval ff f/nf/n Freq. density =(f/n)/10= (f/n)/10
[10,20)[10,20) 4 0.100.10 0.0100.010
[20,30)[20,30) 12 0.300.30 0.0300.030
[30,40)[30,40) 16 0.400.40 0.0400.040
[40,50)[40,50) 8 0.200.20 0.0200.020
Total 40 1.00

(b) The [30,40)[30, 40) bar has the highest frequency density (0.0400.040), so it is the tallest.

(c) Values in [30,40)[30,40) plus [40,50)[40,50): 16+8=2416 + 8 = 24.

24 values are 30.\boxed{24 \text{ values are } \ge 30.}


Try it! 自测练习

Q1. Find the mean, median, and mode of: 3, 7, 7, 9, 43,\ 7,\ 7,\ 9,\ 4.

Q2. The data set {a, 4, 6, 8, 12}\{a,\ 4,\ 6,\ 8,\ 12\} has mean xˉ=7\bar{x} = 7. Find aa.

Q3. Calculate the variance of {2, 4, 4, 4, 6}\{2,\ 4,\ 4,\ 4,\ 6\}.

Q4. A frequency density histogram has a bar over [20,30)[20, 30) with height 0.0250.025. The total sample size is n=80n = 80. How many values fall in [20,30)[20, 30)?

Q5. Two classes both have mean score 75. Class A has σ2=4\sigma^{2} = 4 and Class B has σ2=36\sigma^{2} = 36. Which class is more consistent, and why?

Answers & explanations
  1. Sort: 3, 4, 7, 7, 93,\ 4,\ 7,\ 7,\ 9.

    • Mean: 3+4+7+7+95=305=6\dfrac{3+4+7+7+9}{5} = \dfrac{30}{5} = 6.
    • Median: n=5n=5 (odd), position 3 → 7.
    • Mode: 7 (appears twice).
  2. Total sum =5×7=35= 5 \times 7 = 35. Then a+4+6+8+12=35a=3530=5a + 4 + 6 + 8 + 12 = 35 \Rightarrow a = 35 - 30 = \boxed{5}.

  3. Mean: 2+4+4+4+65=4\dfrac{2+4+4+4+6}{5} = 4. Deviations: 2,0,0,0,+2-2, 0, 0, 0, +2. Squared: 4,0,0,0,44, 0, 0, 0, 4.
    σ2=4+0+0+0+45=85=1.6\sigma^{2} = \dfrac{4+0+0+0+4}{5} = \dfrac{8}{5} = \boxed{1.6}.

  4. Relative frequency =0.025×10=0.25= 0.025 \times 10 = 0.25. Count =0.25×80=20= 0.25 \times 80 = \boxed{20} values.

  5. Class A is more consistent. Both classes average 75, but Class A's variance (σ2=4\sigma^{2}=4) is far smaller, meaning scores cluster tightly around the mean; Class B (σ2=36\sigma^{2}=36) is much more spread out.


📌 Chapter summary

Topic Key formula / concept Common trap
Mean 平均数 xˉ=1nxi\bar{x} = \frac{1}{n}\sum x_i Pulled by outliers; prefer median for skewed data
Median 中位数 Middle of sorted list; average two middles when nn is even Forgetting to sort first; wrong middle position for even nn
Mode 众数 Most frequent value May not exist, or may be non-unique
Variance 方差 σ2=1n(xixˉ)2=x2xˉ2\sigma^{2} = \frac{1}{n}\sum(x_i-\bar{x})^2 = \overline{x^2}-\bar{x}^2 Using n1n-1 (sample) instead of nn (population)
Standard deviation 标准差 σ=σ2\sigma = \sqrt{\sigma^{2}}; same units as data σ2=0\sigma^{2}=0 iff all values equal
Relative frequency 频率 f/nf/n; all classes sum to 1 Confusing frequency count with relative frequency
Frequency density 频率/组距 (f/n)÷h(f/n)\div h; bar area = relative frequency Reading histogram bar height as frequency or probability directly

Connection to earlier units → Variance's shortcut form x2xˉ2\overline{x^2} - \bar{x}^2 is a disguised difference of squares — the same algebraic identity (a2b2)(a^2 - b^2) pattern that appears throughout Unit 1 and Unit 3. Recognising it lets you skip the deviation table entirely on the exam.