Accessing open data with R

We are going to explore the Maryland Open Data Portal and Maryland iMap including:

Accessing Maryland’s Open Data Portal

Downloading data from Maryland’s Open Data Portal

Rows: 2,068
Columns: 18
$ `Unique Identifier`                                           <chr> "fac887d…
$ `Dataset Name`                                                <chr> "MD Cong…
$ Link                                                          <chr> "https:/…
$ `Agency Performing Data Updates`                              <chr> "NULL", …
$ Owner                                                         <chr> "mdimapd…
$ `Data Provided By`                                            <chr> "NULL", …
$ `Source URL`                                                  <chr> "https:/…
$ `Update Frequency`                                            <chr> "NULL", …
$ `Date of Most Recent Data Change`                             <chr> "01/01/1…
$ `Days Since Most Recent Data Change`                          <dbl> -9999, 2…
$ `Date of Most Recent Change (Data Change or Metadata Change)` <chr> "01/01/1…
$ `Updated Recently Enough?`                                    <chr> "NULL", …
$ `Number of Rows`                                              <dbl> -9999, 1…
$ `Tags / Keywords`                                             <chr> "COVID-1…
$ `Column Names`                                                <chr> "NULL", …
$ `Missing Metadata Fields`                                     <chr> "NULL", …
$ Portal                                                        <chr> "Data Ca…
$ Category                                                      <chr> "Coronav…

Using the Socrata API and the {httr2} package

Create a request with httr2::request()

The basic steps to working with an API is to first create a request:

Extracting the body from the response httr2::resp_body_json()

Now, we need to perform the request and extract the body of the response:

Explore and subset the response object

Lastly, you need to get the data you need from the response:

[1] "@context"    "@id"         "@type"       "conformsTo"  "describedBy"
[6] "dataset"    
List of 6
 $ @context   : chr ""
 $ @id        : chr ""
 $ @type      : chr "dcat:Catalog"
 $ conformsTo : chr ""
 $ describedBy: chr ""
 $ dataset    :'data.frame':    1421 obs. of  14 variables:
  ..$ accessLevel : chr [1:1421] "public" "public" "public" "public" ...
  ..$ landingPage : chr [1:1421] "" "" "" "" ...
  ..$ issued      : chr [1:1421] "2016-07-22" "2021-08-31" "2016-07-22" "2022-09-22" ...
  ..$ @type       : chr [1:1421] "dcat:Dataset" "dcat:Dataset" "dcat:Dataset" "dcat:Dataset" ...
  ..$ modified    : chr [1:1421] "2020-01-25" "2021-09-01" "2020-01-25" "2022-09-22" ...
  ..$ license     : chr [1:1421] NA NA NA NA ...
Rows: 1,421
Columns: 14
$ accessLevel  <chr> "public", "public", "public", "public", "public", "public…
$ landingPage  <chr> "", "https://ope…
$ issued       <chr> "2016-07-22", "2021-08-31", "2016-07-22", "2022-09-22", "…
$ `@type`      <chr> "dcat:Dataset", "dcat:Dataset", "dcat:Dataset", "dcat:Dat…
$ modified     <chr> "2020-01-25", "2021-09-01", "2020-01-25", "2022-09-22", "…
$ keyword      <list> <"active pound net sites", "adult habitat", "artificial …
$ contactPoint <df[,3]> <data.frame[26 x 3]>
$ publisher    <df[,2]> <data.frame[26 x 2]>
$ identifier   <chr> "", "…
$ description  <chr> "This is a MD iMAP hosted service. Find more informati…
$ title        <chr> "MD iMAP: Maryland Finfish - Maryland Artificial Reef Ini…
$ distribution <list> [<data.frame[1 x 3]>], <NULL>, [<data.frame[1 x 3]>], <NU…
$ theme        <list> "Biota", <NULL>, "Boundaries", <NULL>, "Society", "Energy…
$ license      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Using a package like httr2 is essential if you are working with an API directly but there are often packages developed around a specific API or portal provider.

Using the RSocrata package

For example, RSocrata was developed by the City of Chicago for working with Socrata data portals:

Rows: 2,068
Columns: 18
$ unique_identifier                                <chr> "5de7448e708b46508718…
$ dataset_name                                     <chr> "Elected_Officials", …
$ link                                             <chr> "…
$ state_agency_performing_data_updates             <chr> "NULL", "NULL", "MDP"…
$ owner                                            <chr> "mdimapdatacatalog", …
$ data_provided_by                                 <chr> "NULL", "NULL", "MDP"…
$ source_url                                       <chr> "https://services.arc…
$ update_frequency                                 <chr> "NULL", "Daily", "As …
$ date_of_most_recent_data_change                  <dttm> 1970-01-01, 1970-01-…
$ days_since_last_data_update                      <dbl> -9999, -9999, 3075, 2…
$ date_of_most_recent_view_change_data_or_metadata <dttm> 1970-01-01, 1970-01-…
$ updated_recently_enough                          <chr> "Metadata on update f…
$ number_of_rows                                   <int> -9999, -9999, 1814, 9…
$ tags_keywords                                    <chr> "elected, officials, …
$ update_status                                    <chr> "NULL", "NULL", "OBJE…
$ missing_metadata_fields                          <chr> "NULL", "NULL", "NULL…
$ portal                                           <chr> "Data Catalog", "Data…
$ category                                         <chr> "NULL", "NULL", "Agri…

Interlude: Working with “powerful numbers”

Bouk, Ackermann, and boyd (2022) offer us a primer on thinking about “powerful numbers” and how the work in the world.

Aside: Exploring the “avalanche of printed numbers”

Aside: Using the “avalanche of printed numbers”

Aside: Extracting tables from images and PDFs

  • {tesseract} is an R package with bindings for the Tesseract-OCR (optical character recognition) engine.
  • Tabula is a tool with an interactive user interface for liberating data tables locked inside (text-based) PDF files.
  • {tabulizer} is an R package for interacting with the Tabula javascript library.

Exploring Maryland iMap

Working with real property data from the Maryland Department of Assessments and Taxation



Here is a recap of the interlude sections:

  • Using Tabula and the {tabulizer} R package
  • Using {officer} to access Word documents
