Unit 12 · Statistics　统计

By the end of this chapter you can:

Calculate the mean, median, and mode of a data set and explain when each measure is most appropriate

Compute variance $\sigma^{2}$ and standard deviation $\sigma$ from raw data using the population formula

Construct and read a frequency distribution table and convert it into a frequency density histogram

Identify the effect of outliers on mean vs. median and choose the right central measure for skewed data

Exam weight on past CSCA papers: ~3% (1–2 of 48 MCQs). Quick high-yield wins.

12.1　Descriptive Statistics　描述统计

Statistics begins with a data set — a list of $n$ numerical values $x_1, x_2, \ldots, x_n$ . Three numbers summarise where the "centre" is.

Mean　平均数

$\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum_{i=1}^{n} x_i$

The mean is the balance point of the data. Every value contributes equally, so a single extreme value (outlier 异常值) can pull the mean far from the rest of the group.

Median　中位数

First sort the values in ascending order (从小到大排列), then:

$n$	Median position
odd	the single middle value at position $\dfrac{n+1}{2}$
even	the average of the two middle values at positions $\dfrac{n}{2}$ and $\dfrac{n}{2}+1$

The median is resistant to outliers — one extreme value cannot shift it by more than one rank.

Mode　众数

The value that appears most frequently. A data set may have one mode, more than one mode (bimodal / multimodal), or no mode if all values appear equally often.

Central tendency at a glance

Measure	Formula / definition	Sensitive to outliers?
Mean 平均数	$\bar{x} = \dfrac{\sum x_i}{n}$	Yes
Median 中位数	middle of sorted list	No
Mode 众数	most frequent value	No

⚠️ Mean vs. median when outliers appear:
If one student scores 5 on a test where everyone else scores 80–90, the mean drops sharply while the median barely moves. CSCA problems often ask which measure "better represents" the data (更能代表数据). Answer: when data is skewed or has clear outliers, the median is a better representative.

🔑 Sort first, always. The exam almost always gives you an unsorted list on purpose. Sort before computing median or identifying mode.

Worked Example 12.1.A

The scores of 7 students are: $72,\ 85,\ 90,\ 60,\ 85,\ 78,\ 5$ .

Find the mean, median, and mode. Which measure best represents the class?

Solution.

Step 1 — Sort: $5,\ 60,\ 72,\ 78,\ 85,\ 85,\ 90$ .

Step 2 — Mean:
$\bar{x} = \frac{5 + 60 + 72 + 78 + 85 + 85 + 90}{7} = \frac{475}{7} \approx 67.9$

Step 3 — Median: $n = 7$ (odd), so position $\dfrac{7+1}{2} = 4$ . The 4th sorted value is $\mathbf{78}$ .

Step 4 — Mode: $85$ appears twice; all other values appear once. Mode $= \mathbf{85}$ .

Which best represents the class?
The score of $5$ is an outlier (e.g. absent or unwell student). The mean $\approx 67.9$ is dragged well below the main cluster of $60$ – $90$ . The median $= 78$ better represents a typical student's result.

$\boxed{\bar{x} \approx 67.9,\qquad \text{median} = 78,\qquad \text{mode} = 85}$

⚠️ Even $n$ trap: When $n = 6$ , for instance, students often pick just one middle value. Always average positions 3 and 4.

12.2　Variance and Standard Deviation　方差与标准差

The mean tells us where the centre is; variance (方差) measures how far data spreads around that centre.

Population variance　方差

$\sigma^{2} = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^{2}$

In words: compute each data value's deviation from the mean, square it, then average all the squares.

Standard deviation　标准差

$\sigma = \sqrt{\sigma^{2}} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^{2}}$

Standard deviation has the same units as the original data, making it more interpretable than variance.

Key properties

Property	Meaning
$\sigma^{2} \ge 0$	Squared terms are never negative
$\sigma^{2} = 0$	All data values are equal to the mean
Larger $\sigma^{2}$	Data is more spread out (more scattered / 离散程度大)
Smaller $\sigma^{2}$	Data is tightly clustered around the mean

🔑 Shortcut formula — mean of squares minus square of mean:
$\sigma^{2} = \frac{1}{n}\sum x_i^{2} - \bar{x}^{2} = \overline{x^{2}} - \bar{x}^{2}$
This avoids computing $(x_i - \bar{x})$ for each value and is faster on the exam.

⚠️ Population vs. sample formula:
The CSCA syllabus uses the population formula with denominator $n$ . The sample variance (样本方差) uses $n-1$ (Bessel's correction). Unless the problem explicitly says "sample variance," use denominator $n$ .

Worked Example 12.2.A

Five students' scores are: $70,\ 75,\ 80,\ 85,\ 90$ . Find $\sigma^{2}$ and $\sigma$ .

Solution.

Step 1 — Mean:
$\bar{x} = \frac{70+75+80+85+90}{5} = \frac{400}{5} = 80$

Step 2 — Deviations and squared deviations:

$x_i$	$x_i - \bar{x}$	$(x_i - \bar{x})^{2}$
70	$-10$	$100$
75	$-5$	$25$
80	$0$	$0$
85	$+5$	$25$
90	$+10$	$100$
Sum	$0$	$250$

Step 3 — Variance:
$\sigma^{2} = \frac{250}{5} = 50$

Step 4 — Standard deviation:
$\sigma = \sqrt{50} = 5\sqrt{2} \approx 7.07$

$\boxed{\sigma^{2} = 50,\qquad \sigma = 5\sqrt{2}}$

Verification using the shortcut:
$\overline{x^{2}} = \frac{70^{2}+75^{2}+80^{2}+85^{2}+90^{2}}{5} = \frac{32250}{5} = 6450$
$\sigma^{2} = 6450 - 80^{2} = 6450 - 6400 = 50 \checkmark$

12.3　Frequency Distributions　频率分布

When a data set is large, we group values into class intervals (组距) and summarise with a frequency distribution table and histogram.

Frequency distribution table　频率分布表

Class interval (区间)	Frequency $f$ (频数)	Relative frequency 频率 $f/n$
$[a_1,\, a_2)$	$f_1$	$f_1/n$
$[a_2,\, a_3)$	$f_2$	$f_2/n$
$\vdots$	$\vdots$	$\vdots$

Frequency 频数 $f$ : raw count of values in the interval.
Relative frequency 频率 $f/n$ : proportion of all data in that interval; all relative frequencies sum to $1$ .
Class width 组距 $h$ : length of each interval.

Frequency density histogram　频率/组距直方图

On a standard CSCA histogram the vertical axis is frequency density (频率/组距), not raw frequency:

$\text{Frequency density} = \frac{\text{Relative frequency}}{\text{Class width}} = \frac{f/n}{h}$

The area of each bar equals the relative frequency of that class:
$\text{Bar area} = \text{frequency density} \times h = \frac{f/n}{h} \times h = \frac{f}{n}$

All bar areas together sum to $1$ .

🔑 Reading a histogram: The $y$ -axis label "频率/组距" is your signal that bar height $\ne$ relative frequency. You must multiply height by class width to get the relative frequency, then multiply by $n$ to recover the count.

⚠️ Unequal class widths: If intervals have different widths, bars of equal relative frequency will have different heights. Frequency density corrects for this — wider intervals get shorter bars, keeping areas honest.

Worked Example 12.3.A

A sample of $n = 40$ measurements is grouped as follows. All class widths are $h = 10$ .

Interval	Frequency $f$
$[10,\, 20)$	4
$[20,\, 30)$	12
$[30,\, 40)$	16
$[40,\, 50)$	8

(a) Complete the relative frequency and frequency density columns.

(b) Which bar is tallest in the histogram?

Solution.

(a)

Interval	$f$	$f/n$	Freq. density $= (f/n)/10$
$[10,20)$	4	$0.10$	$0.010$
$[20,30)$	12	$0.30$	$0.030$
$[30,40)$	16	$0.40$	$0.040$
$[40,50)$	8	$0.20$	$0.020$
Total	40	1.00	—

(b) The $[30, 40)$ bar has the highest frequency density ( $0.040$ ), so it is the tallest.

(c) Values in $[30,40)$ plus $[40,50)$ : $16 + 8 = 24$ .

$\boxed{24 \text{ values are } \ge 30.}$

Try it!　自测练习

Q1. Find the mean, median, and mode of: $3,\ 7,\ 7,\ 9,\ 4$ .

Q2. The data set $\{a,\ 4,\ 6,\ 8,\ 12\}$ has mean $\bar{x} = 7$ . Find $a$ .

Q3. Calculate the variance of $\{2,\ 4,\ 4,\ 4,\ 6\}$ .

Q4. A frequency density histogram has a bar over $[20, 30)$ with height $0.025$ . The total sample size is $n = 80$ . How many values fall in $[20, 30)$ ?

Q5. Two classes both have mean score 75. Class A has $\sigma^{2} = 4$ and Class B has $\sigma^{2} = 36$ . Which class is more consistent, and why?

Answers & explanations

Sort: $3,\ 4,\ 7,\ 7,\ 9$ .
- Mean: $\dfrac{3+4+7+7+9}{5} = \dfrac{30}{5} = 6$ .
- Median: $n=5$ (odd), position 3 → 7.
- Mode: 7 (appears twice).
Total sum $= 5 \times 7 = 35$ . Then $a + 4 + 6 + 8 + 12 = 35 \Rightarrow a = 35 - 30 = \boxed{5}$ .
Mean: $\dfrac{2+4+4+4+6}{5} = 4$ . Deviations: $-2, 0, 0, 0, +2$ . Squared: $4, 0, 0, 0, 4$ .
$\sigma^{2} = \dfrac{4+0+0+0+4}{5} = \dfrac{8}{5} = \boxed{1.6}$ .
Relative frequency $= 0.025 \times 10 = 0.25$ . Count $= 0.25 \times 80 = \boxed{20}$ values.
Class A is more consistent. Both classes average 75, but Class A's variance ( $\sigma^{2}=4$ ) is far smaller, meaning scores cluster tightly around the mean; Class B ( $\sigma^{2}=36$ ) is much more spread out.

📌 Chapter summary

Topic Key formula / concept Common trap

Mean 平均数 $\bar{x} = \frac{1}{n}\sum x_i$ Pulled by outliers; prefer median for skewed data

Median 中位数 Middle of sorted list; average two middles when $n$ is even Forgetting to sort first; wrong middle position for even $n$

Mode 众数 Most frequent value May not exist, or may be non-unique

Variance 方差 $\sigma^{2} = \frac{1}{n}\sum(x_i-\bar{x})^2 = \overline{x^2}-\bar{x}^2$ Using $n-1$ (sample) instead of $n$ (population)

Standard deviation 标准差 $\sigma = \sqrt{\sigma^{2}}$ ; same units as data $\sigma^{2}=0$ iff all values equal

Relative frequency 频率 $f/n$ ; all classes sum to 1 Confusing frequency count with relative frequency

Frequency density 频率/组距 $(f/n)\div h$ ; bar area = relative frequency Reading histogram bar height as frequency or probability directly

Topic	Key formula / concept	Common trap
Mean 平均数	$\bar{x} = \frac{1}{n}\sum x_i$	Pulled by outliers; prefer median for skewed data
Median 中位数	Middle of sorted list; average two middles when $n$ is even	Forgetting to sort first; wrong middle position for even $n$
Mode 众数	Most frequent value	May not exist, or may be non-unique
Variance 方差	$\sigma^{2} = \frac{1}{n}\sum(x_i-\bar{x})^2 = \overline{x^2}-\bar{x}^2$	Using $n-1$ (sample) instead of $n$ (population)
Standard deviation 标准差	$\sigma = \sqrt{\sigma^{2}}$ ; same units as data	$\sigma^{2}=0$ iff all values equal
Relative frequency 频率	$f/n$ ; all classes sum to 1	Confusing frequency count with relative frequency
Frequency density 频率/组距	$(f/n)\div h$ ; bar area = relative frequency	Reading histogram bar height as frequency or probability directly

Connection to earlier units → Variance's shortcut form $\overline{x^2} - \bar{x}^2$ is a disguised difference of squares — the same algebraic identity $(a^2 - b^2)$ pattern that appears throughout Unit 1 and Unit 3. Recognising it lets you skip the deviation table entirely on the exam.

Unit 12 · Statistics 统计

12.1 Descriptive Statistics 描述统计

Mean 平均数

Median 中位数

Mode 众数

Central tendency at a glance

Worked Example 12.1.A

12.2 Variance and Standard Deviation 方差与标准差

Population variance 方差

Standard deviation 标准差

Key properties

Worked Example 12.2.A

12.3 Frequency Distributions 频率分布

Frequency distribution table 频率分布表

Frequency density histogram 频率/组距直方图

Worked Example 12.3.A

Try it! 自测练习

Unit 12 · Statistics　统计

12.1　Descriptive Statistics　描述统计

Mean　平均数

Median　中位数

Mode　众数

12.2　Variance and Standard Deviation　方差与标准差

Population variance　方差

Standard deviation　标准差

12.3　Frequency Distributions　频率分布

Frequency distribution table　频率分布表

Frequency density histogram　频率/组距直方图

Try it!　自测练习