Unit 12 · Statistics 统计
By the end of this chapter you can:
- Calculate the mean, median, and mode of a data set and explain when each measure is most appropriate
- Compute variance and standard deviation from raw data using the population formula
- Construct and read a frequency distribution table and convert it into a frequency density histogram
- Identify the effect of outliers on mean vs. median and choose the right central measure for skewed data
Exam weight on past CSCA papers: ~3% (1–2 of 48 MCQs). Quick high-yield wins.
12.1 Descriptive Statistics 描述统计
Statistics begins with a data set — a list of numerical values . Three numbers summarise where the "centre" is.
Mean 平均数
The mean is the balance point of the data. Every value contributes equally, so a single extreme value (outlier 异常值) can pull the mean far from the rest of the group.
Median 中位数
First sort the values in ascending order (从小到大排列), then:
| Median position | |
|---|---|
| odd | the single middle value at position |
| even | the average of the two middle values at positions and |
The median is resistant to outliers — one extreme value cannot shift it by more than one rank.
Mode 众数
The value that appears most frequently. A data set may have one mode, more than one mode (bimodal / multimodal), or no mode if all values appear equally often.
Central tendency at a glance
| Measure | Formula / definition | Sensitive to outliers? |
|---|---|---|
| Mean 平均数 | Yes | |
| Median 中位数 | middle of sorted list | No |
| Mode 众数 | most frequent value | No |
⚠️ Mean vs. median when outliers appear:
If one student scores 5 on a test where everyone else scores 80–90, the mean drops sharply while the median barely moves. CSCA problems often ask which measure "better represents" the data (更能代表数据). Answer: when data is skewed or has clear outliers, the median is a better representative.
🔑 Sort first, always. The exam almost always gives you an unsorted list on purpose. Sort before computing median or identifying mode.
Worked Example 12.1.A
The scores of 7 students are: .
Find the mean, median, and mode. Which measure best represents the class?
Solution.
Step 1 — Sort: .
Step 2 — Mean:
Step 3 — Median: (odd), so position . The 4th sorted value is .
Step 4 — Mode: appears twice; all other values appear once. Mode .
Which best represents the class?
The score of is an outlier (e.g. absent or unwell student). The mean is dragged well below the main cluster of –. The median better represents a typical student's result.
⚠️ Even trap: When , for instance, students often pick just one middle value. Always average positions 3 and 4.
12.2 Variance and Standard Deviation 方差与标准差
The mean tells us where the centre is; variance (方差) measures how far data spreads around that centre.
Population variance 方差
In words: compute each data value's deviation from the mean, square it, then average all the squares.
Standard deviation 标准差
Standard deviation has the same units as the original data, making it more interpretable than variance.
Key properties
| Property | Meaning |
|---|---|
| Squared terms are never negative | |
| All data values are equal to the mean | |
| Larger | Data is more spread out (more scattered / 离散程度大) |
| Smaller | Data is tightly clustered around the mean |
🔑 Shortcut formula — mean of squares minus square of mean:
This avoids computing for each value and is faster on the exam.
⚠️ Population vs. sample formula:
The CSCA syllabus uses the population formula with denominator . The sample variance (样本方差) uses (Bessel's correction). Unless the problem explicitly says "sample variance," use denominator .
Worked Example 12.2.A
Five students' scores are: . Find and .
Solution.
Step 1 — Mean:
Step 2 — Deviations and squared deviations:
| 70 | ||
| 75 | ||
| 80 | ||
| 85 | ||
| 90 | ||
| Sum |
Step 3 — Variance:
Step 4 — Standard deviation:
Verification using the shortcut:
12.3 Frequency Distributions 频率分布
When a data set is large, we group values into class intervals (组距) and summarise with a frequency distribution table and histogram.
Frequency distribution table 频率分布表
| Class interval (区间) | Frequency (频数) | Relative frequency 频率 |
|---|---|---|
- Frequency 频数 : raw count of values in the interval.
- Relative frequency 频率 : proportion of all data in that interval; all relative frequencies sum to .
- Class width 组距 : length of each interval.
Frequency density histogram 频率/组距直方图
On a standard CSCA histogram the vertical axis is frequency density (频率/组距), not raw frequency:
The area of each bar equals the relative frequency of that class:
All bar areas together sum to .
🔑 Reading a histogram: The -axis label "频率/组距" is your signal that bar height relative frequency. You must multiply height by class width to get the relative frequency, then multiply by to recover the count.
⚠️ Unequal class widths: If intervals have different widths, bars of equal relative frequency will have different heights. Frequency density corrects for this — wider intervals get shorter bars, keeping areas honest.
Worked Example 12.3.A
A sample of measurements is grouped as follows. All class widths are .
| Interval | Frequency |
|---|---|
| 4 | |
| 12 | |
| 16 | |
| 8 |
(a) Complete the relative frequency and frequency density columns.
(b) Which bar is tallest in the histogram?
(c) How many values are at least 30?
Solution.
(a)
| Interval | Freq. density | ||
|---|---|---|---|
| 4 | |||
| 12 | |||
| 16 | |||
| 8 | |||
| Total | 40 | 1.00 | — |
(b) The bar has the highest frequency density (), so it is the tallest.
(c) Values in plus : .
Try it! 自测练习
Q1. Find the mean, median, and mode of: .
Q2. The data set has mean . Find .
Q3. Calculate the variance of .
Q4. A frequency density histogram has a bar over with height . The total sample size is . How many values fall in ?
Q5. Two classes both have mean score 75. Class A has and Class B has . Which class is more consistent, and why?
Answers & explanations
Sort: .
- Mean: .
- Median: (odd), position 3 → 7.
- Mode: 7 (appears twice).
Total sum . Then .
Mean: . Deviations: . Squared: .
.Relative frequency . Count values.
Class A is more consistent. Both classes average 75, but Class A's variance () is far smaller, meaning scores cluster tightly around the mean; Class B () is much more spread out.
📌 Chapter summary
Topic Key formula / concept Common trap Mean 平均数 Pulled by outliers; prefer median for skewed data Median 中位数 Middle of sorted list; average two middles when is even Forgetting to sort first; wrong middle position for even Mode 众数 Most frequent value May not exist, or may be non-unique Variance 方差 Using (sample) instead of (population) Standard deviation 标准差 ; same units as data iff all values equal Relative frequency 频率 ; all classes sum to 1 Confusing frequency count with relative frequency Frequency density 频率/组距 ; bar area = relative frequency Reading histogram bar height as frequency or probability directly
Connection to earlier units → Variance's shortcut form is a disguised difference of squares — the same algebraic identity pattern that appears throughout Unit 1 and Unit 3. Recognising it lets you skip the deviation table entirely on the exam.