Understanding NormalizeData() function
In this short article, using a tiny, 5x5 matrix example dataset, I will explain the NormalizeData()
function and the three Normalization.method()
choices.
The cell expression data is often contained in the form of a
- Rows: Genes
- Columns: Cells/Samples/Replicates
- Entries: Number of RNA transcripts detected
The expression data is contained
Cell1 | Cell2 | Cell3 | Cell4 | Cell5 | |
---|---|---|---|---|---|
GeneA | 2 | 10 | 9 | 8 | 9 |
GeneB | 2 | 4 | 10 | 8 | 6 |
GeneC | 9 | 3 | 4 | 8 | 9 |
GeneD | 1 | 5 | 2 | 2 | 8 |
GeneE | 5 | 8 | 10 | 7 | 2 |
Total | 19 | 30 | 35 | 33 | 34 |
As columns in count matrix represent UMI counts for each cells, the sums of UMI counts in a column represent total cellular expression.
As mentioned above, there are three Normalization.method()
choices under NormalizeData()
function in Seurat.
By default, the NormalizeData()
function of Seurat pipeline uses 10,000 as a scale.factor
. The UMI counts for each cell is multiplied by the scale.factor
and divided by the total UMI counts for the cell.
Cell1 | Cell2 | Cell3 | Cell4 | Cell5 | |
---|---|---|---|---|---|
GeneA | (2 * 10000) / 19 | (10 * 10000) / 30 | (9 * 10000) / 35 | (8 * 10000) / 33 | (9 * 10000) / 34 |
GeneB | (2 * 10000) / 19 | (4 * 10000) / 30 | (10 * 10000) / 35 | (8 * 10000) / 33 | (6 * 10000) / 34 |
GeneC | (9 * 10000) / 19 | (3 * 10000) / 30 | (4 * 10000) / 35 | (8 * 10000) / 33 | (9 * 10000) / 34 |
GeneD | (1 * 10000) / 19 | (5 * 10000) / 30 | (2 * 10000) / 35 | (2 * 10000) / 33 | (8 * 10000) / 34 |
GeneE | (5 * 10000) / 19 | (8 * 10000) / 30 | (10 * 10000) / 35 | (7 * 10000) / 33 | (2 * 10000) / 34 |
Cell1 | Cell2 | Cell3 | Cell4 | Cell5 | |
---|---|---|---|---|---|
GeneA | 1052.63 | 3333.33 | 2571.42 | 2424.24 | 2647.05 |
GeneB | 1052.63 | 1333.33 | 2857.14 | 2424.24 | 1764.70 |
GeneC | 4736.84 | 1000 | 1142.85 | 2424.24 | 2647.05 |
GeneD | 526.31 | 1666.66 | 571.42 | 606.06 | 2352.94 |
GeneE | 2631.57 | 2666.67 | 2857.14 | 2121.21 | 588.23 |
In this step, the scaled UMI counts are natural log transformed using log1p, i.e. ln(x + 1).
Cell1 | Cell2 | Cell3 | Cell4 | Cell5 | |
---|---|---|---|---|---|
GeneA | ln(1052.63 + 1) | ln(3333.33 + 1) | ln(2571.42 + 1) | ln(2424.24 + 1) | ln(2647.05 + 1) |
GeneB | ln(1052.63 + 1 | ln(1333.33 + 1) | ln(2857.14 + 1) | ln(2424.24 + 1) | ln(1764.70 + 1) |
GeneC | ln(4736.84 + 1) | ln(1000 + 1) | ln(1142.85 + 1) | ln(2424.24 + 1) | ln(2647.05 + 1) |
GeneD | ln(526.31 + 1) | ln(1666.66 + 1) | ln(571.42 + 1) | ln(606.06 + 1) | ln(2352.94 + 1) |
GeneE | ln(2631.57 + 1) | ln(2666.67 + 1) | ln(2857.14 + 1) | ln(2121.21 + 1) | ln(588.23 + 1) |
Why +1?
If any empty cell is detected, i.e. zero UMI count, the log-normalization, ln(0), will result in a ‘mathematical error’. In contrast, ln(0 + 1) will give a value of 0.
In short,
- ln(0) = Numerical error
- ln(0 + 1) = 0
Cell1 | Cell2 | Cell3 | Cell4 | Cell5 | |
---|---|---|---|---|---|
GeneA | 6.959998122 | 8.112028038 | 7.852605701 | 7.793686767 | 7.881582131 |
GeneB | 6.959998122 | 7.19618707 | 7.957927342 | 7.793686767 | 7.476305823 |
GeneC | 8.463337059 | 6.908754779 | 7.042161289 | 7.793686767 | 7.881582131 |
GeneD | 6.26779959 | 7.419180723 | 6.349887962 | 6.408628631 | 7.763846299 |
GeneE | 7.875719233 | 7.888959462 | 7.957927342 | 7.660214277 | 6.378825585 |
Total | 36.52685213 | 37.52511007 | 37.16050964 | 37.44990321 | 37.38214197 |
(This section is under development)
In contrast to
By default, the NormalizeData()
function of Seurat pipeline uses 10,000 as a scale.factor
. The UMI counts for each cell is multiplied by the scale.factor
and divided by the total UMI counts for the cell.
Cell1 | Cell2 | Cell3 | Cell4 | Cell5 | |
---|---|---|---|---|---|
GeneA | (2 * 10000) / 19 | (10 * 10000) / 30 | (9 * 10000) / 35 | (8 * 10000) / 33 | (9 * 10000) / 34 |
GeneB | (2 * 10000) / 19 | (4 * 10000) / 30 | (10 * 10000) / 35 | (8 * 10000) / 33 | (6 * 10000) / 34 |
GeneC | (9 * 10000) / 19 | (3 * 10000) / 30 | (4 * 10000) / 35 | (8 * 10000) / 33 | (9 * 10000) / 34 |
GeneD | (1 * 10000) / 19 | (5 * 10000) / 30 | (2 * 10000) / 35 | (2 * 10000) / 33 | (8 * 10000) / 34 |
GeneE | (5 * 10000) / 19 | (8 * 10000) / 30 | (10 * 10000) / 35 | (7 * 10000) / 33 | (2 * 10000) / 34 |
Cell1 | Cell2 | Cell3 | Cell4 | Cell5 | |
---|---|---|---|---|---|
GeneA | 1052.63 | 3333.33 | 2571.42 | 2424.24 | 2647.05 |
GeneB | 1052.63 | 1333.33 | 2857.14 | 2424.24 | 1764.70 |
GeneC | 4736.84 | 1000 | 1142.85 | 2424.24 | 2647.05 |
GeneD | 526.31 | 1666.66 | 571.42 | 606.06 | 2352.94 |
GeneE | 2631.57 | 2666.67 | 2857.14 | 2121.21 | 588.23 |
Total | 10000 | 10000 | 10000 | 10000 | 10000 |