From Real Exams Quiz
Primary 4 Mathematics Fractions Quiz
Free Exam-Derived NVIDIA Nemotron 3 Ultra 550B A55B Free Primary 4 Mathematics Fractions quiz with questions and answers for Singapore students. This page is rendered as a direct URL so the questions and answers can be discovered without pressing in-page buttons.
These static practice materials are generated from the site's syllabus and paper-generation workflow, with source and model context shown so students and parents can evaluate the material before use.
Questions
Stage 3 Quiz: Advanced Topics
1. What is the primary purpose of a Generative Adversarial Network (GAN)? A) To classify images into predefined categories B) To generate new data samples that resemble training data C) To reduce the dimensionality of data D) To optimize hyperparameters automatically
2. In Transfer Learning, what does "fine-tuning" typically involve? A) Training a model from scratch on a new dataset B) Freezing all layers of a pre-trained model C) Unfreezing some top layers of a pre-trained model and training on new data D) Using only the input layer of a pre-trained model
3. What is the "Vanishing Gradient Problem" in deep neural networks? A) Gradients become too large, causing unstable training B) Gradients become extremely small, preventing weight updates in early layers C) The loss function becomes non-differentiable D) The learning rate decays to zero too quickly
4. Which technique is commonly used to prevent overfitting in deep learning models? A) Increasing model complexity indefinitely B) Dropout C) Removing all regularization D) Training for infinite epochs
5. What does the Attention Mechanism in Transformers allow the model to do? A) Process sequences strictly sequentially B) Focus on relevant parts of the input sequence when producing output C) Reduce the vocabulary size D) Eliminate the need for positional encoding
6. In a GAN architecture, what is the role of the Discriminator? A) To generate fake data samples B) To classify input data as real or fake C) To optimize the generator's loss function D) To reduce the dimensionality of the latent space
7. When using a pre-trained model for Transfer Learning on a small dataset, what is the recommended initial approach? A) Train all layers with a very high learning rate B) Freeze the convolutional base and train only the classifier head C) Remove the pre-trained weights and re-initialize randomly D) Use the model for inference only without any training
8. Which activation function helps mitigate the Vanishing Gradient Problem in deep networks? A) Sigmoid B) Tanh C) ReLU (Rectified Linear Unit) D) Softmax
9. What is the primary purpose of Batch Normalization in deep neural networks? A) To increase the number of parameters B) To normalize the inputs of each layer to stabilize and accelerate training C) To replace the need for activation functions D) To perform data augmentation on the fly
10. In the context of regularization, what does L2 Regularization (Weight Decay) do? A) Adds a penalty proportional to the absolute value of weights B) Adds a penalty proportional to the square of the weights C) Randomly sets weights to zero during training D) Clips gradients to a maximum value
11. What is the key innovation of the Transformer architecture compared to RNNs/LSTMs? A) It uses recurrence to process sequences B) It relies entirely on self-attention mechanisms, enabling parallelization C) It processes tokens one at a time sequentially D) It uses convolutional layers for sequence modeling
12. In Multi-Head Attention, why are multiple attention heads used? A) To increase computational complexity unnecessarily B) To allow the model to attend to information from different representation subspaces simultaneously C) To reduce the dimensionality of the model D) To replace the feed-forward network
13. What is "Mode Collapse" in the context of GAN training? A) The discriminator becomes too strong and rejects all samples B) The generator produces a limited variety of samples, ignoring parts of the data distribution C) The loss function converges to zero immediately D) The model architecture collapses into a single layer
14. Which technique is specifically designed to address the issue of catastrophic forgetting in continual learning? A) Dropout B) Elastic Weight Consolidation (EWC) C) Batch Normalization D) Data Augmentation
15. What is the purpose of Positional Encoding in the Transformer model? A) To encode the semantic meaning of words B) To inject information about the relative or absolute position of tokens in the sequence C) To reduce the sequence length D) To normalize the attention scores
16. In the Diffusion Model framework, what does the "forward process" do? A) Generates data from pure noise B) Gradually adds Gaussian noise to data until it becomes pure noise C) Denoises the data step by step D) Calculates the loss function for the discriminator
17. What is the main advantage of using Layer Normalization over Batch Normalization in Transformer models? A) It normalizes across the batch dimension B) It is independent of batch size and works well for variable sequence lengths C) It requires fewer learnable parameters D) It eliminates the need for residual connections
18. In Reinforcement Learning, what does the "Exploration vs. Exploitation" trade-off refer to? A) Choosing between training and inference modes B) Balancing trying new actions to discover rewards vs. using known high-reward actions C) Deciding between supervised and unsupervised learning D) Selecting the optimal batch size
19. What is the function of the "Feed-Forward Network" (FFN) in each Transformer encoder/decoder layer? A) To compute attention scores B) To apply position-wise non-linear transformations to each token representation independently C) To generate positional encodings D) To normalize the output of the attention layer
20. Which evaluation metric is most appropriate for assessing the quality of generated images from a GAN? A) Accuracy B) FID (Fréchet Inception Distance) C) Perplexity D) BLEU Score
Answers
Stage 3 Quiz Answers
1. B - GANs consist of a generator and discriminator competing via adversarial training. The generator learns to produce realistic data samples (e.g., images) from random noise, while the discriminator learns to distinguish real from fake. The primary purpose is generation of new, plausible data.
2. C - Fine-tuning involves taking a pre-trained model (usually on a large dataset like ImageNet), freezing the early layers (which learn generic features like edges/textures), and unfreezing/training the later layers (task-specific features) on the new target dataset with a low learning rate. This adapts the model without destroying pre-learned features.
3. B - During backpropagation, gradients are multiplied by derivatives of activation functions (e.g., sigmoid/tanh derivatives < 1) and weights across many layers. In deep networks, this product shrinks exponentially, becoming vanishingly small for early layers. Weights in early layers barely update, halting learning. Solutions: ReLU, Residual Connections, Batch Norm.
4. B - Dropout randomly sets a fraction of neuron outputs to zero during each training step. This prevents neurons from co-adapting too heavily (relying on specific other neurons), forcing the network to learn redundant, robust representations. It acts as an ensemble of many thinned networks.
5. B - Attention (specifically Scaled Dot-Product Attention) computes Attention(Q, K, V) = softmax(QK^T/√d_k)V. It allows each output position to compute a weighted sum of all input values (V), where weights depend on compatibility between Query (Q) and Keys (K). This captures long-range dependencies directly, unlike RNNs.
6. B - The Discriminator (D) is a binary classifier. Its input is either real data (from training set) or fake data (from Generator G). It outputs a probability (real vs. fake). It is trained to maximize log(D(x)) + log(1-D(G(z))). The Generator tries to minimize log(1-D(G(z))) (or maximize log(D(G(z)))).
7. B - With a small dataset, training a deep network from scratch overfits severely. Freezing the convolutional base (feature extractor) leverages pre-learned generic features. Only the randomly initialized classifier head (dense layers) is trained initially. This is "Feature Extraction" mode. Fine-tuning (unfreezing some base layers) may follow.
8. C - ReLU (f(x)=max(0,x)) has a derivative of 1 for x>0. Unlike sigmoid/tanh (derivatives < 0.25), gradients flow unchanged through active ReLU paths during backpropagation, mitigating vanishing gradients. Variants (Leaky ReLU, GELU) address the "dying ReLU" issue (zero gradient for x<0).
9. B - Batch Norm normalizes layer inputs (activations) per mini-batch: μ_B = mean(x), σ_B² = var(x), x̂ = (x-μ_B)/√(σ_B²+ε), y = γx̂ + β. This reduces Internal Covariate Shift (distribution change of layer inputs), allows higher learning rates, acts as regularization, and makes training deeper networks feasible.
10. B - L2 Regularization adds λ/2 * Σ w² to the loss function. The gradient update becomes w ← w - η(∂L/∂w + λw) = w(1-ηλ) - η∂L/∂w. This shrinks weights proportionally (weight decay), preferring smaller weights, smoothing the model, and reducing overfitting. L1 (Lasso) uses |w| for sparsity.
11. B - RNNs/LSTMs process tokens sequentially (t=1..T), preventing parallelization and struggling with long-range dependencies due to sequential path length O(T). Transformers use Self-Attention: every token attends to every other token in parallel (O(1) sequential ops), capturing global context directly. Positional encoding adds order info.
12. B - Multi-Head Attention projects Q, K, V into h different subspaces (heads) via learned linear projections. Each head computes attention independently: head_i = Attention(QW_i^Q, KW_i^K, VW_i^V). Outputs are concatenated and projected. This allows attending to different relationships (e.g., syntactic, semantic, positional) simultaneously.
13. B - Mode Collapse: Generator G finds a single output (or small set) that fools Discriminator D, and converges to producing only that mode. D cannot provide useful gradients to push G toward other modes. The generated distribution lacks diversity compared to the true data distribution. Solutions: Minibatch Discrimination, Unrolled GAN, WGAN.
14. B - Catastrophic Forgetting: Training on Task B erases knowledge of Task A. EWC adds a quadratic penalty to the loss: L_total = L_B + Σ (λ/2) F_i (θ_i - θ_i^A)², where F_i is the Fisher Information (importance) of parameter θ_i for Task A. Important parameters are constrained near their Task A values.
15. B - Self-Attention is permutation equivariant: it treats input as a set, ignoring order. Positional Encodings (PE) are added to token embeddings to inject sequence order. Original Transformer uses fixed sinusoidal PE: PE(pos, 2i) = sin(pos/10000^(2i/d)), PE(pos, 2i+1) = cos(...). Learned PE is also used.
16. B - Diffusion Models (e.g., DDPM) define a Markov chain Forward Process: q(x_t | x_{t-1}) = N(x_t; √(1-β_t) x_{t-1}, β_t I). Starting from data x_0, Gaussian noise is added over T steps until x_T ~ N(0, I). The Reverse Process (learned) denoises: p_θ(x_{t-1} | x_t).
17. B - Batch Norm computes statistics over the batch dimension (N). It fails for small batches, variable sequence lengths (padding issues), and RNNs/Transformers where sequence length varies. Layer Norm computes statistics over the feature dimension (d_model) for each sample independently. It is batch-size invariant and standard in Transformers.
18. B - An RL agent must Explore (try suboptimal actions to discover potentially better rewards/state transitions) vs. Exploit (choose the currently estimated best action to maximize immediate reward). Strategies: ε-greedy (random action with prob ε), UCB (optimism in face of uncertainty), Thompson Sampling (Bayesian).
19. B - In each Transformer layer (Encoder/Decoder), after Attention + Add&Norm, there is a Position-wise FFN: FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 (ReLU/GELU activation). Applied identically but independently to each token position (position-wise). It processes the information gathered by attention, adding non-linearity and capacity.
20. B - FID compares statistics of generated images vs. real images in Inception-v3 feature space: FID = ||μ_r - μ_g||² + Tr(Σ_r + Σ_g - 2(Σ_r Σ_g)^(1/2)). Lower FID = better quality & diversity. It correlates well with human perception. Inception Score (IS) is older; Precision/Recall for distributions give more detail. Accuracy/BLEU/Perplexity are for classification/translation/LM.