Title: | Tools for Statistical Disclosure Control in Research Data Centers |
---|---|
Description: | Tools for researchers to explicitly show that their results comply to rules for statistical disclosure control imposed by research data centers. These tools help in checking descriptive statistics and models and in calculating extreme values that are not individual data. Also included is a simple function to create log files. The methods used here are described in the "Guidelines for the checking of output based on microdata research" by Bond, Brandt, and de Wolf (2015) <https://ec.europa.eu/eurostat/cros/system/files/dwb_standalone-document_output-checking-guidelines.pdf>. |
Authors: | Matthias Gomolka [aut, cre], Tim Becker [aut], Pantelis Karapanagiotis [ctb] |
Maintainer: | Matthias Gomolka <[email protected]> |
License: | GPL-3 |
Version: | 0.5.0 |
Built: | 2025-02-27 03:46:42 UTC |
Source: | https://github.com/matthiasgomolka/sdclog |
arguments
data |
data.frame from which the descriptive statistics are calculated. |
id_var |
character The name of the id variable. Defaults to |
val_var |
character vector of value variables on which descriptive statistics are computed. |
by |
character vector of grouping variables. |
zero_as_NA |
logical If TRUE, zeros in 'val_var' are treated as NA. |
fill_id_var |
logical Only for very specific use cases. For example:
If Defaults to |
model |
The estimated model object. Can be a model type like lm, glm
and various others (anything which can be handled by |
min_obs |
integer The minimum number of observations used to calculate
the minimum and maximum. Defaults to |
max_obs |
integer The maximum number of observations used to calculate
the minimum and maximum. Defaults to |
These methods print SDC objects. Tables containing information are only printed when relevant.
## S3 method for class 'sdc_distinct_ids' print(x, ...) ## S3 method for class 'sdc_dominance' print(x, ...) ## S3 method for class 'sdc_options' print(x, ...) ## S3 method for class 'sdc_settings' print(x, ...) ## S3 method for class 'sdc_descriptives' print(x, ...) ## S3 method for class 'sdc_model' print(x, ...) ## S3 method for class 'sdc_min_max' print(x, ...)
## S3 method for class 'sdc_distinct_ids' print(x, ...) ## S3 method for class 'sdc_dominance' print(x, ...) ## S3 method for class 'sdc_options' print(x, ...) ## S3 method for class 'sdc_settings' print(x, ...) ## S3 method for class 'sdc_descriptives' print(x, ...) ## S3 method for class 'sdc_model' print(x, ...) ## S3 method for class 'sdc_min_max' print(x, ...)
x |
The object to be printed |
... |
Ignored. |
Checks the number of distinct entities and the (n, k) dominance rule for your descriptive statistics.
That means that sdc_descriptives()
checks if there are at least 5
distinct entities and if the largest 2 entities account for 85% or more of
val_var
. The parameters can be changed using options. For details see
vignette("options", package = "sdcLog")
.
sdc_descriptives( data, id_var = getOption("sdc.id_var"), val_var = NULL, by = NULL, zero_as_NA = NULL, fill_id_var = FALSE )
sdc_descriptives( data, id_var = getOption("sdc.id_var"), val_var = NULL, by = NULL, zero_as_NA = NULL, fill_id_var = FALSE )
data |
data.frame from which the descriptive statistics are calculated. |
id_var |
character The name of the id variable. Defaults to |
val_var |
character vector of value variables on which descriptive statistics are computed. |
by |
character vector of grouping variables. |
zero_as_NA |
logical If TRUE, zeros in 'val_var' are treated as NA. |
fill_id_var |
logical Only for very specific use cases. For example:
If Defaults to |
The general form of the \((n, k)\) dominance rule can be formulated as:
\[\sum_{i=1}^{n}x_i > \frac{k}{100} \sum_{i=1}^{N}x_i\]where \(x_1 \ge x_2 \ge \cdots \ge x_{N}\). \(n\) denotes the number of largest contributions to be considered, \(x_n\) the \(n\)-th largest contribution, \(k\) the maximal percentage these \(n\) contributions may account for, and \(N\) is the total number of observations.
If the statement above is true, the \((n, k)\) dominance rule is violated.
A list of class sdc_descriptives
with detailed information about
options, settings, and compliance with the criteria distinct entities and
dominance.
sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_1" ) sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_1", by = "sector" ) sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_1", by = c("sector", "year") ) sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_2", by = c("sector", "year") ) sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_2", by = c("sector", "year"), zero_as_NA = FALSE )
sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_1" ) sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_1", by = "sector" ) sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_1", by = c("sector", "year") ) sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_2", by = c("sector", "year") ) sdc_descriptives( data = sdc_descriptives_DT, id_var = "id", val_var = "val_2", by = c("sector", "year"), zero_as_NA = FALSE )
sdc_descriptives()
Utilized in the vignette.
data("sdc_descriptives_DT")
data("sdc_descriptives_DT")
A data.table with 20 rows and 5 columns.
The data.table contains the following columns:
id factor random identifier
sector factor economic sector
year integer time variable
val_1, val_2 numeric value variables
This function creates Stata-like log files from R Scripts. It can handle several files (in a character vector) at once.
sdc_log(r_script, destination, replace = FALSE, append = FALSE, local = FALSE)
sdc_log(r_script, destination, replace = FALSE, append = FALSE, local = FALSE)
r_script |
character Path of the R script to be run with logging. |
destination |
One of:
|
replace |
logical Indicates whether to replace an existing log file. |
append |
logical Indicates whether to append an existing log file. |
local |
One of:
|
character vector holding the path(s) of the written log file(s).
Checks if calculation of extreme values comply to RDC rules. If so, function returns average min and max values according to RDC rules.
sdc_min_max( data, id_var = getOption("sdc.id_var"), val_var, by = NULL, max_obs = nrow(data), fill_id_var = FALSE )
sdc_min_max( data, id_var = getOption("sdc.id_var"), val_var, by = NULL, max_obs = nrow(data), fill_id_var = FALSE )
data |
data.frame from which the descriptive statistics are calculated. |
id_var |
character The name of the id variable. Defaults to |
val_var |
character vector of value variables on which descriptive statistics are computed. |
by |
character vector of grouping variables. |
max_obs |
integer The maximum number of observations used to calculate
the minimum and maximum. Defaults to |
fill_id_var |
logical Only for very specific use cases. For example:
If Defaults to |
A list list of class sdc_min_max
with detailed information about
options, settings and the calculated extreme values (if possible).
sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_1") sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_2") sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_3", max_obs = 10) sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_1", by = "year") sdc_min_max( sdc_min_max_DT, id_var = "id", val_var = "val_1", by = c("sector", "year") )
sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_1") sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_2") sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_3", max_obs = 10) sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_1", by = "year") sdc_min_max( sdc_min_max_DT, id_var = "id", val_var = "val_1", by = c("sector", "year") )
sdc_min_max()
Utilized in the vignette
data("sdc_min_max_DT")
data("sdc_min_max_DT")
A data.table with 20 rows and 6 columns.
The data.table contains the following columns:
id factor random identifier
sector factor economic sector
year integer time variable
val_1 - val_3 numeric value variables
Checks if your model complies to RDC rules. Checks for overall number of entities and number of entities for each level of dummy variables.
sdc_model(data, model, id_var = getOption("sdc.id_var"), fill_id_var = FALSE)
sdc_model(data, model, id_var = getOption("sdc.id_var"), fill_id_var = FALSE)
data |
data.frame which was used to build the model. |
model |
The estimated model object. Can be a model type like lm, glm
and various others (anything which can be handled by |
id_var |
character The name of the id variable. Defaults to |
fill_id_var |
logical Only for very specific use cases. For example:
If Defaults to |
A list of class sdc_model
with detailed information about
options, settings, and compliance with the distinct entities criterion.
# Check simple models model_1 <- lm(y ~ x_1 + x_2, data = sdc_model_DT) sdc_model(data = sdc_model_DT, model = model_1, id_var = "id") model_2 <- lm(y ~ x_1 + x_2 + x_3, data = sdc_model_DT) sdc_model(data = sdc_model_DT, model = model_2, id_var = "id") model_3 <- lm(y ~ x_1 + x_2 + dummy_3, data = sdc_model_DT) sdc_model(data = sdc_model_DT, model = model_3, id_var = "id")
# Check simple models model_1 <- lm(y ~ x_1 + x_2, data = sdc_model_DT) sdc_model(data = sdc_model_DT, model = model_1, id_var = "id") model_2 <- lm(y ~ x_1 + x_2 + x_3, data = sdc_model_DT) sdc_model(data = sdc_model_DT, model = model_2, id_var = "id") model_3 <- lm(y ~ x_1 + x_2 + dummy_3, data = sdc_model_DT) sdc_model(data = sdc_model_DT, model = model_3, id_var = "id")
sdc_model()
Utilized in the vignette
data("sdc_model_DT")
data("sdc_model_DT")
A data.table with 80 rows and 9 columns.
The data.table contains the following columns:
id factor random identifier
y - x_4 numeric value variables
dummy_1 - dummy_3 factor dummy variables