Data Extraction

Overview

The trashpanda package includes a set of data import utilities designed to:

List files in complex directory trees with filtering.
Safely read multiple file types (CSV, Excel) while handling errors.
Extract tables from arbitrary locations within Excel sheets.

These tools are useful for projects where data may be scattered across folders or formatted inconsistently.

Listing Files

list_data_files() recursively lists files in a directory, with options to filter by:

File extension (ext)
Folder or file patterns to include or exclude
Returning a tibble or a list of paths

library(trashpanda)
library(tibble)

# List all Excel files in nested directories containing "results"
files <- list_data_files(
  path = "path/to/project",
  ext = c("xls", "xlsx", "xlsm"),
  include = c("results")
)

files

Output: A tibble with file, folder, and filename columns, providing a clean inventory of data sources.

Reading Files Safely

safe_read_data() reads a single file while capturing any errors. Returns a list:

data: tibble of the imported file (or NULL on failure)

error: error message if reading failed

file: file path

This is especially useful when reading multiple files, some of which may be malformed.

results <- safe_read_data(
  file = "data/example.xlsx",
  reader = readxl::read_excel,
  sheet = "Sheet1",
  cols = c("id", "value", "date")
)

# Access imported data safely
results$data

Reading a Directory Tree

read_data_tree() combines the previous functions:

Recursively lists all relevant files
Reads each file safely with safe_read_data()
Optionally enforces consistent column names
Combines multiple Excel sheets automatically

all_data <- read_data_tree(
  path = "path/to/project",
  reader = readxl::read_excel,
  ext = c("xlsx", "xls"),
  sheet_pattern = "2026",     # optional regex
  cols = c("id", "value", "date")
)

all_data

Why this is useful:

You can process hundreds of files scattered in different files with a single call
Ensures that column names and types remain consistent
Automatically handles errors and skips unreadable files

Extracting Tables Anywhere in Excel

Some Excel sheets are messy, with tables not starting in the first row. extract_table():

Searches for a table starting at a known header (start_column)
Optionally identifies multiple tables per sheet (table_mode = "all")
Adds metadata columns: file name, sheet, and table ID

tbl <- extract_table(
  path = "data/example.xlsx",
  sheet = "Sheet1",
  start_column = "Sample ID"
)

tbl

This function is particularly useful when:

You don’t know the exact row where the table starts and the tables are in different locations in each sheet or file
The table is somewhat consistent (e.g., you know what the table starts with)
Multiple tables exist in the same sheet
You want to keep file and sheet metadata

Putting It All Together

These functions can be combined for batch processing of Excel data:

# Extract tables from all files, safely
read_data_tree(
  path = here(),
  ext = "xlsx",
  recursive = TRUE,
  reader = extract_table,
  start_column = "Sample ID",
  table_mode = "all",
  safely = TRUE,
  id_cols = FALSE
)

In the above example, the following occurs:

All files in the base (here::here()) and sub directories (recursive = TRUE) with the extension “.xlsx” are listed
Tables from excel files are automatically located and read via the extract_table function passed to reader
If there are multiple tables matching criteria on the same sheet, both are read (table_mode = "all")

An example output looks like the following:

# A tibble: 40 × 6
   sheet_name        `Sample ID` Value Notes Extra  source_file                                                      
   <chr>             <chr>       <chr> <chr> <chr>  <chr>                                                            
 1 clean_simple      A1          10    ok    NA     C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
 2 clean_simple      A2          12    ok    NA     C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
 3 clean_simple      A3          15    ok    NA     C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
 4 offset_table      B1          5     low   NA     C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
 5 offset_table      B2          8     mid   NA     C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
 6 dynamic_width     C1          1     x     NA     C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
 7 dynamic_width     C2          2     y     NA     C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
 8 explicit_last_col D1          100   a     ignore C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
 9 explicit_last_col D2          200   b     ignore C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
10 internal_blanks   E1          1     NA    NA     C:/Projects/trashpanda/test/extract_table_test_sheets - Copy.xlsx
# ℹ 30 more rows