Skip to contents

This function can apply the following common data cleaning tasks:

  • Applies stringr::str_squish and stringr::str_trim to all character columns

  • Optionally replaces all character values of "" with NA values

  • Optionally corrects UNIX formatted dates with 1970-01-01 origins

  • Optionally renames variables by passing a named list of variables

The address functions previously included with format_data() are now documented at format_address_data().

Usage

format_data(
  x,
  var_names = NULL,
  xwalk = NULL,
  clean_names = TRUE,
  .name_repair = "check_unique",
  replace_na_with = NULL,
  replace_with_na = NULL,
  replace_empty_char_with_na = FALSE,
  fix_date = FALSE,
  label = FALSE,
  remove_empty = NULL,
  remove_constant = FALSE,
  format_sf = FALSE,
  ...,
  call = caller_env()
)

rename_with_xwalk(
  x,
  xwalk = NULL,
  label = FALSE,
  .strict = TRUE,
  keep_all = TRUE,
  arg = caller_arg(x),
  call = caller_env()
)

label_with_xwalk(x, xwalk = NULL, label = "var", ...)

make_variable_dictionary(
  x,
  .labels = NULL,
  .definitions = NULL,
  details = c("basic", "none", "full")
)

fix_epoch_date(x, .cols = dplyr::contains("date"), tz = "")

Arguments

x

A tibble or data frame object

var_names

A named list following the format, list("New var name" = old_var_name), or a two column data frame with the first column being the new variable names and the second column being the old variable names; defaults to NULL.

xwalk

A data frame with two columns using the first column as name and the second column as value; or a named list. The existing names of x must be the values and the new names must be the names.

clean_names

If TRUE, set .name_repair to janitor::make_clean_names(); defaults to TRUE.

.name_repair

Defaults to "check_unique"

replace_na_with

A named list to pass to tidyr::replace_na(); defaults to NULL.

replace_with_na

A named list to pass to naniar::replace_with_na(); defaults to NULL.

replace_empty_char_with_na

If TRUE, replace "" with NA using naniar::replace_with_na_if(), Default: TRUE

fix_date

If FALSE, fix UNIX epoch dates (common issue with dates from FeatureServer and MapServer sources) using the fix_epoch_date() function, Default: TRUE

label

For label_with_xwalk() use label = "val" to use labelled::set_value_labels() or "var" (default) to use labelled::set_variable_labels(). For rename_with_xwalk(), if label is TRUE, xwalk is passed to label_with_xwalk() with label = "var" to label columns using the original names. Defaults to FALSE.

remove_empty

If not NULL, pass values ("rows", "cols" or c("rows", "cols") (default)) to the which parameter of janitor::remove_empty()

remove_constant

If TRUE, pass data to janitor::remove_constant() using default parameters.

format_sf

If TRUE, pass x and additional parameters to format_sf_data().

...

Additional parameters passed to format_sf_data()

.strict

If TRUE (default), require that all values from the xwalk are found in the column names of the x data.frame. If FALSE, unmatched values from the xwalk are ignored.

keep_all

If FALSE, columns that are not named in the xwalk are dropped. If TRUE (default), all columns are retained. If x is an sf object, the geometry column will not be dropped even it is not renamed.

arg, call

Additional parameters used internally with cli::cli_abort() to improve error messages.

.labels

Replaces labels column created by labelled::generate_dictionary() if column is all NA (no existing labels assigned); defaults to NULL.

.definitions

Character vector of definitions appended to dictionary data frame. Must be in the same order as the variables in the provided data frame x.

details

add details about each variable (full details could be time consuming for big data frames, FALSE is equivalent to "none" and TRUE to "full")

.cols

tidyselect for columns to apply epoch date fixing function to. Defaults to dplyr::contains("date").

tz

Time zone passed to as.POSIXct().

Value

The input data frame or simple feature object with formatting functions applied.

Examples

nc <- get_location_data(data = system.file("shape/nc.shp", package = "sf"))

format_data(nc)
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> # A tibble: 100 × 15
#>     area perimeter  cnty cnty_id name  fips  fipsno cress_id bir74 sid74 nwbir74
#>  * <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>   <dbl>
#>  1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1      10
#>  2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0      10
#>  3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5     208
#>  4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1     123
#>  5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9    1066
#>  6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7     954
#>  7 0.062      1.55  1834    1834 Camd… 37029  37029       15   286     0     115
#>  8 0.091      1.28  1835    1835 Gates 37073  37073       37   420     0     254
#>  9 0.118      1.42  1836    1836 Warr… 37185  37185       93   968     4     748
#> 10 0.124      1.43  1837    1837 Stok… 37169  37169       85  1612     1     160
#> # ℹ 90 more rows
#> # ℹ 4 more variables: bir79 <dbl>, sid79 <dbl>, nwbir79 <dbl>,
#> #   geometry <MULTIPOLYGON [°]>