You can set several options to
First, we create a tiny data.frame
to demonstrate the
effects of the options:
library(sdcLog)
df <- data.frame(id = LETTERS[1:3], v1 = 1L:3L, v2 = c(1L, 2L, 4L))
df
# id v1 v2
# 1 A 1 1
# 2 B 2 2
# 3 C 3 4
By default, sdcLog expects at least five different entities behind
each calculated number. The functions in sdcLog derive this number from
getOption("sdc.n_ids", default = 5)
. That is, if the option
sdc.n_ids
is not set, it defaults to 5
.
Consider the following example:
sdc_descriptives(data = df, id_var = "id", val_var = "v1")
# Warning: DISCLOSURE PROBLEM: Not enough distinct entities.
# ────────────────────────────────────────────────── SDC results (descriptives) ──
# OPTIONS: sdc.n_ids: 5 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85
# SETTINGS: id_var: id | val_var: v1 | zero_as_NA: FALSE
# ✖ Not enough distinct entities:
# distinct_ids
# 1: 3
# ────────────────────────────────────────────────────────────────────────────────
This can be adapted to the policy of your research data center by
setting the option sdc.n_ids
to the desired value. For
example, if your policy allows results to be released if there are at
least three different entities behind each number, set
Now, getOption("sdc.n_ids", default = 5)
evaluates to
3
and warnings are thrown only if there are less than three
entities behind each result. Note that this is reflected in the first
line of output from every function of sdcLog:
sdc_descriptives(data = df, id_var = "id", val_var = "v1")
# ────────────────────────────────────────────────── SDC results (descriptives) ──
# OPTIONS: sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85
# SETTINGS: id_var: id | val_var: v1 | zero_as_NA: FALSE
# ✔ Output complies to RDC rules.
# ────────────────────────────────────────────────────────────────────────────────
The default value for sdc.n_ids_dominance
is
2
. In our example, this leads to a warning:
sdc_descriptives(data = df, id_var = "id", val_var = "v2")
# Warning: DISCLOSURE PROBLEM: Dominant entities.
# ────────────────────────────────────────────────── SDC results (descriptives) ──
# OPTIONS: sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85
# SETTINGS: id_var: id | val_var: v2 | zero_as_NA: FALSE
# ✖ Dominant entities:
# value_share
# 1: 0.8571429
# ────────────────────────────────────────────────────────────────────────────────
If your policy requires only the largest entity alone to attribute
for a share of less than 0.85
, set
Then, there is no problem in the example:
sdc_descriptives(data = df, id_var = "id", val_var = "v2")
# ────────────────────────────────────────────────── SDC results (descriptives) ──
# OPTIONS: sdc.n_ids: 3 | sdc.n_ids_dominance: 1 | sdc.share_dominance: 0.85
# SETTINGS: id_var: id | val_var: v2 | zero_as_NA: FALSE
# ✔ Output complies to RDC rules.
# ────────────────────────────────────────────────────────────────────────────────
This option differs from the previous ones in the sense that is does
not affect actual calculations. Instead, it determines the verbosity of
the output of sdcLog functions. Possible values are 0
,
1
(default), and 2
. Before demonstrating the
effects of sdc.info_level
, we reset
sdc.share_dominance
to it’s default value of
0.85
.
The example below shows the different levels of information printed
to the console based on the different levels of
sdc.info_level
:
for (i in 0:2) {
options(sdc.info_level = i)
cat("\nsdc.info_level: ", getOption("sdc.info_level"), "\n")
print(sdc_descriptives(data = df, id_var = "id", val_var = "v1"))
}
#
# sdc.info_level: 0
# ────────────────────────────────────────────────── SDC results (descriptives) ──
# OPTIONS: sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85
# SETTINGS: id_var: id | val_var: v1 | zero_as_NA: FALSE
# ────────────────────────────────────────────────────────────────────────────────
#
# sdc.info_level: 1
# ────────────────────────────────────────────────── SDC results (descriptives) ──
# OPTIONS: sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85
# SETTINGS: id_var: id | val_var: v1 | zero_as_NA: FALSE
# ✔ Output complies to RDC rules.
# ────────────────────────────────────────────────────────────────────────────────
#
# sdc.info_level: 2
# ────────────────────────────────────────────────── SDC results (descriptives) ──
# OPTIONS: sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85
# SETTINGS: id_var: id | val_var: v1 | zero_as_NA: FALSE
# ✔ No problem with number of distinct entities (3).
# ✔ No problem with dominance (0.83).
# ✔ Output complies to RDC rules.
# ────────────────────────────────────────────────────────────────────────────────
At level 0
, only options and settings are printed. Level
1
also prints a short message about the overall outcome of
the checks. Level 2
additionally prints the results of the
separate checks on distinct entities and dominance.
Usually, the ID variable does not change during the course of your analysis. Therefore, it is convenient to set
Then you do not have to specify id_var
every time you
use one of the sdc_*
functions:
sdc_descriptives(data = df, val_var = "v1")
# ────────────────────────────────────────────────── SDC results (descriptives) ──
# OPTIONS: sdc.n_ids: 3 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85
# SETTINGS: id_var: id | val_var: v1 | zero_as_NA: FALSE
# ✔ Output complies to RDC rules.
# ────────────────────────────────────────────────────────────────────────────────
Please note that these options affect all functions of sdcLog, not
just sdc_descriptives()
.