
Format data frames and simple features using common approaches
Source:R/format_data.R
format_data.RdThis function can apply the following common data cleaning tasks:
Applies stringr::str_squish and stringr::str_trim to all character columns
Optionally replaces all character values of "" with
NAvaluesOptionally corrects UNIX formatted dates with 1970-01-01 origins
Optionally renames variables by passing a named list of variables
The address functions previously included with format_data() are now
documented at format_address_data().
Usage
format_data(
x,
var_names = NULL,
xwalk = NULL,
clean_names = TRUE,
.name_repair = "check_unique",
replace_na_with = NULL,
replace_with_na = NULL,
replace_empty_char_with_na = FALSE,
fix_date = FALSE,
label = FALSE,
remove_empty = NULL,
remove_constant = FALSE,
format_sf = FALSE,
...,
call = caller_env()
)
rename_with_xwalk(
x,
xwalk = NULL,
cols = c("label", "name"),
label = FALSE,
.strict = TRUE,
keep_all = TRUE,
arg = caller_arg(x),
call = caller_env()
)
label_with_xwalk(x, xwalk = NULL, label = "var", ...)
make_variable_dictionary(
x,
.labels = NULL,
.definitions = NULL,
details = c("basic", "none", "full")
)
fix_epoch_date(x, .cols = dplyr::contains("date"), tz = "")Arguments
- x
A tibble or data frame object
- var_names
A named list following the format,
list("New var name" = old_var_name), or a two column data frame with the first column being the new variable names and the second column being the old variable names; defaults toNULL.- xwalk
A data frame with two columns using the first column as name and the second column as value; or a named list. The existing names of x must be the values and the new names must be the names.
- clean_names
If
TRUE, set .name_repair tojanitor::make_clean_names(); defaults toTRUE.- .name_repair
Defaults to "check_unique"
- replace_na_with
A named list to pass to
tidyr::replace_na(); defaults toNULL.- replace_with_na
A named list to pass to
naniar::replace_with_na(); defaults toNULL.- replace_empty_char_with_na
If
TRUE, replace "" withNAusingnaniar::replace_with_na_if(), Default:TRUE- fix_date
If
FALSE, fix UNIX epoch dates (common issue with dates from FeatureServer and MapServer sources) using thefix_epoch_date()function, Default:TRUE- label
For
label_with_xwalk()uselabel = "val"to uselabelled::set_value_labels()or "var" (default) to uselabelled::set_variable_labels(). Forrename_with_xwalk(), if label isTRUE, xwalk is passed tolabel_with_xwalk()with label = "var" to label columns using the original names. Defaults toFALSE.- remove_empty
If not
NULL, pass values ("rows", "cols" or c("rows", "cols") (default)) to the which parameter ofjanitor::remove_empty()- remove_constant
If
TRUE, pass data to janitor::remove_constant() using default parameters.- format_sf
If
TRUE, pass x and additional parameters toformat_sf_data().- ...
Additional parameters passed to
format_sf_data()- cols
Column names to use for crosswalk.
- .strict
If
TRUE(default), require that all values from the xwalk are found in the column names of the x data.frame. IfFALSE, unmatched values from the xwalk are ignored.- keep_all
If
FALSE, columns that are not named in the xwalk are dropped. IfTRUE(default), all columns are retained. If x is ansfobject, the geometry column will not be dropped even it is not renamed.- arg, call
Additional parameters used internally with
cli::cli_abort()to improve error messages.- .labels
Replaces labels column created by
labelled::generate_dictionary()if column is allNA(no existing labels assigned); defaults toNULL.- .definitions
Character vector of definitions appended to dictionary data frame. Must be in the same order as the variables in the provided data frame x.
- details
add details about each variable (full details could be time consuming for big data frames,
FALSEis equivalent to"none"andTRUEto"full")- .cols
tidyselect for columns to apply epoch date fixing function to. Defaults to
dplyr::contains("date").- tz
Time zone passed to
as.POSIXct().
Examples
nc <- get_location_data(data = system.file("shape/nc.shp", package = "sf"))
format_data(nc)
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -9386880 ymin: 4012991 xmax: -8399788 ymax: 4382079
#> Projected CRS: WGS 84 / Pseudo-Mercator
#> # A tibble: 100 × 15
#> area perimeter cnty cnty_id name fips fipsno cress_id bir74 sid74 nwbir74
#> * <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
#> 2 0.061 1.23 1827 1827 Alle… 37005 37005 3 487 0 10
#> 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
#> 4 0.07 2.97 1831 1831 Curr… 37053 37053 27 508 1 123
#> 5 0.153 2.21 1832 1832 Nort… 37131 37131 66 1421 9 1066
#> 6 0.097 1.67 1833 1833 Hert… 37091 37091 46 1452 7 954
#> 7 0.062 1.55 1834 1834 Camd… 37029 37029 15 286 0 115
#> 8 0.091 1.28 1835 1835 Gates 37073 37073 37 420 0 254
#> 9 0.118 1.42 1836 1836 Warr… 37185 37185 93 968 4 748
#> 10 0.124 1.43 1837 1837 Stok… 37169 37169 85 1612 1 160
#> # ℹ 90 more rows
#> # ℹ 4 more variables: bir79 <dbl>, sid79 <dbl>, nwbir79 <dbl>,
#> # geometry <MULTIPOLYGON [m]>