Package 'sdcLog' reference manual

Title:	Tools for Statistical Disclosure Control in Research Data Centers
Description:	Tools for researchers to explicitly show that their results comply to rules for statistical disclosure control imposed by research data centers. These tools help in checking descriptive statistics and models and in calculating extreme values that are not individual data. Also included is a simple function to create log files. The methods used here are described in the "Guidelines for the checking of output based on microdata research" by Bond, Brandt, and de Wolf (2015) <https://ec.europa.eu/eurostat/cros/system/files/dwb_standalone-document_output-checking-guidelines.pdf>.
Authors:	Matthias Gomolka [aut, cre], Tim Becker [aut], Pantelis Karapanagiotis [ctb]
Maintainer:	Matthias Gomolka <[email protected]>
License:	GPL-3
Version:	0.5.0
Built:	2025-02-27 03:46:42 UTC
Source:	https://github.com/matthiasgomolka/sdclog

arguments

Description

arguments

Arguments

`data`	data.frame from which the descriptive statistics are calculated.
`id_var`	character The name of the id variable. Defaults to `getOption("sdc.id_var")` so that you can provide `options(sdc.id_var = "my_id_var")` at the top of your script.
`val_var`	character vector of value variables on which descriptive statistics are computed.
`by`	character vector of grouping variables.
`zero_as_NA`	logical If TRUE, zeros in 'val_var' are treated as NA.
`fill_id_var`	logical Only for very specific use cases. For example: `id_var` contains `NA` values which represent missing values in the sense that there actually exist values identifying the entity but are unknown (or deleted for privacy reasons). `id_var` contains `NA` values which result from the fact that an observation features more than one confidential identifier and not all of these identifiers are present in each observation. Examples for such identifiers are the role of a broker in a security transaction or the role of a collateral giver in a credit relationship. If `TRUE`, `NA` values within `id_var` will internally be filled with `⁠<filled_[i]>⁠`, assuming that all `NA` values of `id_var` can be treated as different small entities for statistical disclosure control purposes. Thus, set `TRUE` only if this is a reasonable assumption. Defaults to `FALSE`.
`model`	The estimated model object. Can be a model type like lm, glm and various others (anything which can be handled by `broom::augment()`).
`min_obs`	integer The minimum number of observations used to calculate the minimum and maximum. Defaults to `getOption("sdc.n_ids", 5L)`. This is not the number of distinct entities.
`max_obs`	integer The maximum number of observations used to calculate the minimum and maximum. Defaults to `nrow(data)`. This is not the number of distinct entities.

Print methods for SDC objects

Description

These methods print SDC objects. Tables containing information are only printed when relevant.

Usage

## S3 method for class 'sdc_distinct_ids'
print(x, ...)

## S3 method for class 'sdc_dominance'
print(x, ...)

## S3 method for class 'sdc_options'
print(x, ...)

## S3 method for class 'sdc_settings'
print(x, ...)

## S3 method for class 'sdc_descriptives'
print(x, ...)

## S3 method for class 'sdc_model'
print(x, ...)

## S3 method for class 'sdc_min_max'
print(x, ...)
## S3 method for class 'sdc_distinct_ids'
print(x, ...)

## S3 method for class 'sdc_dominance'
print(x, ...)

## S3 method for class 'sdc_options'
print(x, ...)

## S3 method for class 'sdc_settings'
print(x, ...)

## S3 method for class 'sdc_descriptives'
print(x, ...)

## S3 method for class 'sdc_model'
print(x, ...)

## S3 method for class 'sdc_min_max'
print(x, ...)

Arguments

`x`	The object to be printed
`...`	Ignored.

Disclosure control for descriptive statistics

Description

Checks the number of distinct entities and the (n, k) dominance rule for your descriptive statistics.

That means that sdc_descriptives() checks if there are at least 5 distinct entities and if the largest 2 entities account for 85% or more of val_var. The parameters can be changed using options. For details see vignette("options", package = "sdcLog").

Usage

sdc_descriptives(
  data,
  id_var = getOption("sdc.id_var"),
  val_var = NULL,
  by = NULL,
  zero_as_NA = NULL,
  fill_id_var = FALSE
)
sdc_descriptives(
  data,
  id_var = getOption("sdc.id_var"),
  val_var = NULL,
  by = NULL,
  zero_as_NA = NULL,
  fill_id_var = FALSE
)

Arguments

`data`	data.frame from which the descriptive statistics are calculated.
`id_var`	character The name of the id variable. Defaults to `getOption("sdc.id_var")` so that you can provide `options(sdc.id_var = "my_id_var")` at the top of your script.
`val_var`	character vector of value variables on which descriptive statistics are computed.
`by`	character vector of grouping variables.
`zero_as_NA`	logical If TRUE, zeros in 'val_var' are treated as NA.
`fill_id_var`	logical Only for very specific use cases. For example: `id_var` contains `NA` values which represent missing values in the sense that there actually exist values identifying the entity but are unknown (or deleted for privacy reasons). `id_var` contains `NA` values which result from the fact that an observation features more than one confidential identifier and not all of these identifiers are present in each observation. Examples for such identifiers are the role of a broker in a security transaction or the role of a collateral giver in a credit relationship. If `TRUE`, `NA` values within `id_var` will internally be filled with `⁠<filled_[i]>⁠`, assuming that all `NA` values of `id_var` can be treated as different small entities for statistical disclosure control purposes. Thus, set `TRUE` only if this is a reasonable assumption. Defaults to `FALSE`.

Details

The general form of the \((n, k)\) dominance rule can be formulated as:

\[\sum_{i=1}^{n}x_i > \frac{k}{100} \sum_{i=1}^{N}x_i\]

where \(x_1 \ge x_2 \ge \cdots \ge x_{N}\). \(n\) denotes the number of largest contributions to be considered, \(x_n\) the \(n\)-th largest contribution, \(k\) the maximal percentage these \(n\) contributions may account for, and \(N\) is the total number of observations.

If the statement above is true, the \((n, k)\) dominance rule is violated.

Value

A list of class sdc_descriptives with detailed information about options, settings, and compliance with the criteria distinct entities and dominance.

Examples

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1"
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1",
  by = "sector"
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1",
  by = c("sector", "year")
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year")
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year"),
  zero_as_NA = FALSE
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1"
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1",
  by = "sector"
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1",
  by = c("sector", "year")
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year")
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year"),
  zero_as_NA = FALSE
)

Example data for `sdc_descriptives()`

Description

Utilized in the vignette.

Usage

data("sdc_descriptives_DT")
data("sdc_descriptives_DT")

Format

A data.table with 20 rows and 5 columns.

Details

The data.table contains the following columns:

id factor random identifier
sector factor economic sector
year integer time variable
val_1, val_2 numeric value variables

Create Stata-like log files from R Scripts

Description

This function creates Stata-like log files from R Scripts. It can handle several files (in a character vector) at once.

Usage

sdc_log(r_script, destination, replace = FALSE, append = FALSE, local = FALSE)
sdc_log(r_script, destination, replace = FALSE, append = FALSE, local = FALSE)

Arguments

`r_script`	character Path of the R script to be run with logging.
`destination`	One of: character Path of the log file to be used. file connection to which the log should be written. This is especially useful, when you have nested calls to `sdc_log()` and want to write everything into the same log file. Then, create a single file connection and provide this connection to all calls to `sdc_log()` (and close it afterwards).
`replace`	logical Indicates whether to replace an existing log file.
`append`	logical Indicates whether to append an existing log file.
`local`	One of: logical Indicates whether to evaluate within the global environment (`FALSE`) or the calling environment (`TRUE`). environment A specific evaluation environment. Determines the evaluation environment. Useful whenever `sdc_log()` is called from within a function, or for nested `sdc_log()` calls. By default (`FALSE`) evaluation occurs in the global environment. See also source.

Value

character vector holding the path(s) of the written log file(s).

Calculate RDC rule-compliant extreme values

Description

Checks if calculation of extreme values comply to RDC rules. If so, function returns average min and max values according to RDC rules.

Usage

sdc_min_max(
  data,
  id_var = getOption("sdc.id_var"),
  val_var,
  by = NULL,
  max_obs = nrow(data),
  fill_id_var = FALSE
)
sdc_min_max(
  data,
  id_var = getOption("sdc.id_var"),
  val_var,
  by = NULL,
  max_obs = nrow(data),
  fill_id_var = FALSE
)

Arguments

`data`	data.frame from which the descriptive statistics are calculated.
`id_var`	character The name of the id variable. Defaults to `getOption("sdc.id_var")` so that you can provide `options(sdc.id_var = "my_id_var")` at the top of your script.
`val_var`	character vector of value variables on which descriptive statistics are computed.
`by`	character vector of grouping variables.
`max_obs`	integer The maximum number of observations used to calculate the minimum and maximum. Defaults to `nrow(data)`. This is not the number of distinct entities.
`fill_id_var`	logical Only for very specific use cases. For example: `id_var` contains `NA` values which represent missing values in the sense that there actually exist values identifying the entity but are unknown (or deleted for privacy reasons). `id_var` contains `NA` values which result from the fact that an observation features more than one confidential identifier and not all of these identifiers are present in each observation. Examples for such identifiers are the role of a broker in a security transaction or the role of a collateral giver in a credit relationship. If `TRUE`, `NA` values within `id_var` will internally be filled with `⁠<filled_[i]>⁠`, assuming that all `NA` values of `id_var` can be treated as different small entities for statistical disclosure control purposes. Thus, set `TRUE` only if this is a reasonable assumption. Defaults to `FALSE`.

Value

A list list of class sdc_min_max with detailed information about options, settings and the calculated extreme values (if possible).

Examples

sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_1")
sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_2")
sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_3", max_obs = 10)
sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_1", by = "year")
sdc_min_max(
  sdc_min_max_DT, id_var = "id", val_var = "val_1", by = c("sector", "year")
)

sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_1")
sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_2")
sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_3", max_obs = 10)
sdc_min_max(sdc_min_max_DT, id_var = "id", val_var = "val_1", by = "year")
sdc_min_max(
  sdc_min_max_DT, id_var = "id", val_var = "val_1", by = c("sector", "year")
)

Example data for `sdc_min_max()`

Description

Utilized in the vignette

Usage

data("sdc_min_max_DT")
data("sdc_min_max_DT")

Format

A data.table with 20 rows and 6 columns.

Details

The data.table contains the following columns:

id factor random identifier
sector factor economic sector
year integer time variable
val_1 - val_3 numeric value variables

Disclosure control for models

Description

Checks if your model complies to RDC rules. Checks for overall number of entities and number of entities for each level of dummy variables.

Usage

sdc_model(data, model, id_var = getOption("sdc.id_var"), fill_id_var = FALSE)
sdc_model(data, model, id_var = getOption("sdc.id_var"), fill_id_var = FALSE)

Arguments

`data`	data.frame which was used to build the model.
`model`	The estimated model object. Can be a model type like lm, glm and various others (anything which can be handled by `broom::augment()`).
`id_var`	character The name of the id variable. Defaults to `getOption("sdc.id_var")` so that you can provide `options(sdc.id_var = "my_id_var")` at the top of your script.
`fill_id_var`	logical Only for very specific use cases. For example: `id_var` contains `NA` values which represent missing values in the sense that there actually exist values identifying the entity but are unknown (or deleted for privacy reasons). `id_var` contains `NA` values which result from the fact that an observation features more than one confidential identifier and not all of these identifiers are present in each observation. Examples for such identifiers are the role of a broker in a security transaction or the role of a collateral giver in a credit relationship. If `TRUE`, `NA` values within `id_var` will internally be filled with `⁠<filled_[i]>⁠`, assuming that all `NA` values of `id_var` can be treated as different small entities for statistical disclosure control purposes. Thus, set `TRUE` only if this is a reasonable assumption. Defaults to `FALSE`.

Value

A list of class sdc_model with detailed information about options, settings, and compliance with the distinct entities criterion.

Examples

# Check simple models
model_1 <- lm(y ~ x_1 + x_2, data = sdc_model_DT)
sdc_model(data = sdc_model_DT, model = model_1, id_var = "id")

model_2 <- lm(y ~ x_1 + x_2 + x_3, data = sdc_model_DT)
sdc_model(data = sdc_model_DT, model = model_2, id_var = "id")

model_3 <- lm(y ~ x_1 + x_2 + dummy_3, data = sdc_model_DT)
sdc_model(data = sdc_model_DT, model = model_3, id_var = "id")

# Check simple models
model_1 <- lm(y ~ x_1 + x_2, data = sdc_model_DT)
sdc_model(data = sdc_model_DT, model = model_1, id_var = "id")

model_2 <- lm(y ~ x_1 + x_2 + x_3, data = sdc_model_DT)
sdc_model(data = sdc_model_DT, model = model_2, id_var = "id")

model_3 <- lm(y ~ x_1 + x_2 + dummy_3, data = sdc_model_DT)
sdc_model(data = sdc_model_DT, model = model_3, id_var = "id")

Example data for `sdc_model()`

Description

Utilized in the vignette

Usage

data("sdc_model_DT")
data("sdc_model_DT")

Format

A data.table with 80 rows and 9 columns.

Details

The data.table contains the following columns:

id factor random identifier
y - x_4 numeric value variables
dummy_1 - dummy_3 factor dummy variables

Package 'sdcLog'

Help Index

arguments

Description

Arguments

Print methods for SDC objects

Description

Usage

Arguments

Disclosure control for descriptive statistics

Description

Usage

Arguments

Details

Value

Examples

Example data for sdc_descriptives()

Description

Usage

Format

Details

Create Stata-like log files from R Scripts

Description

Usage

Arguments

Value

Calculate RDC rule-compliant extreme values

Description

Usage

Arguments

Value

Examples

Example data for sdc_min_max()

Description

Usage

Format

Details

Disclosure control for models

Description

Usage

Arguments

Value

Examples

Example data for sdc_model()

Description

Usage

Format

Details

Example data for `sdc_descriptives()`

Example data for `sdc_min_max()`

Example data for `sdc_model()`