Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read a single file from an archive #271

Open
PietrH opened this issue Nov 19, 2024 · 3 comments
Open

Read a single file from an archive #271

PietrH opened this issue Nov 19, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@PietrH
Copy link
Member

PietrH commented Nov 19, 2024

Bart sent us a message with an example where he was able to read a single events.csv from a 10Gb archive very quickly.

Hi Pieter & Peter, I thought this might interest you. I just tried reading partial files a bit more and used a camera trap zenodo repository by Julian as a example. There I can read the events.csv from a 10Gb archive within a second. That is kind of cool for applications where you are only interested in a subset of data (say all tiger images for camera traps or only summer radar data)

system.time(a <- vroom::vroom(
  archive::archive_read(
    "https://zenodo.org/records/10671148/files/pilot1.zip?download=1",
    file = "pilot1/events.csv"
  )
))
#> Rows: 30506 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (3): eventID, deploymentID, mediaID
#> dttm (2): eventStart, eventEnd
#>
#> :information_source: Use `spec()` to retrieve the full column specification for this data.
#> :information_source: Specify the column types or set `show_col_types = FALSE` to quiet this message.
#>    user  system elapsed
#>   0.412   0.046   0.984
tibble::glimpse(a)
#> Rows: 30,506
#> Columns: 5
#> $ eventID      <chr> "42d09be5-1b91-49e1-a154-864eb557c0a4", "42d09be5-1b91-49…
#> $ deploymentID <chr> "AWD_2_13082021_pilot bc34dfce-8ee3-4e97-870e-d53079b80ce…
#> $ eventStart   <dttm> 2022-08-20 04:27:38, 2022-08-20 04:27:38, 2022-08-20 04:…
#> $ eventEnd     <dttm> 2022-08-20 04:27:44, 2022-08-20 04:27:44, 2022-08-20 04:…
#> $ mediaID      <chr> "d919bdd2-35e0-4219-b74d-45f2201d5ba1", "77522548-6728-45…

However, two other files took longer:

Maybe I have been a bit premature as it seems to depend on the position in the file how quick the read is. Two other csv (media and observations) that are about equally sized take much longer while locally they are about as quick ....

system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "https://zenodo.org/records/10671148/files/pilot2.zip?download=1",
    file = "pilot2/events.csv"
  )
))
#>    user  system elapsed 
#>   0.334   0.017   0.712
system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "https://zenodo.org/records/10671148/files/pilot2.zip?download=1",
    file = "pilot2/observations.csv"
  )
))
#>    user  system elapsed 
#>   1.416   1.319  30.998
system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "https://zenodo.org/records/10671148/files/pilot2.zip?download=1",
    file = "pilot2/media.csv"
  )
))
#>    user  system elapsed 
#>   1.458   1.406  32.188


system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "~/Downloads/pilot2.zip",
    file = "pilot2/events.csv"
  )
))
#>    user  system elapsed 
#>   0.012   0.002   0.014
system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "~/Downloads/pilot2.zip",
    file = "pilot2/observations.csv"
  )
))
#>    user  system elapsed 
#>   0.044   0.013   0.058
system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "~/Downloads/pilot2.zip",
    file = "pilot2/media.csv"
  )
))
#>    user  system elapsed 
#>   0.036   0.017   0.052

In Python he had better luck:

import unzip_http
import pandas
rzf = unzip_http.RemoteZipFile("https://zenodo.org/records/10671148/files/pilot2.zip")
rzf.namelist()
binfp = rzf.open('pilot2/observations.csv')
print(binfp.readlines())
@PietrH PietrH added the enhancement New feature or request label Nov 19, 2024
@bart1
Copy link

bart1 commented Nov 19, 2024

From my research I find that zip files have a central directory at the end of the file that can be read to find where the separate files are located. Remote files on zenodo can be partially read using the Range header (see https://rhardih.io/2021/04/listing-the-contents-of-a-remote-zip-archive-without-downloading-the-entire-file/ ). I can also do this with httr2. It seems the R package archive is based on libarchive which is designed as a streaming library and thus does not support this kind of random access. I have not been able to find an R library that support this random access in remote zip files. But it can be done as is shown by the python example.

@bart1
Copy link

bart1 commented Nov 19, 2024

This is an other nice blog about the process: https://www.djmannion.net/partial_zip/index.html

@bart1
Copy link

bart1 commented Nov 19, 2024

I have played around a bit and parsing a remote zip file is not so difficult the following code reads data from a remote zip over http. There is still plenty of room for improvements but files can already be read within seconds

require(httr2)
#> Loading required package: httr2
system.time({
  get_cd <- function(x = "https://zenodo.org/records/10671148/files/pilot2.zip") {
    end <- request(x) |>
      req_headers(Range = "bytes=-22") |>
      req_perform() |>
      purrr::chuck("body")
    cd_start <- end[17:20] |>
      rawToBits() |>
      packBits("integer")
    cd_len <- end[13:16] |>
      rawToBits() |>
      packBits("integer")
    header <- request(x) |>
      req_headers(Range = glue::glue("bytes={cd_start}-{cd_start+cd_len+22-1}")) |>
      req_perform() |>
      purrr::chuck("body")
    return(header)
  }
  raw2ToInt <- function(x) {
    c(x, as.raw(0x00), as.raw(0x00)) |>
      rawToBits() |>
      packBits("integer")
  }

  dd <- get_cd()

  parsecd <- function(x) {
    deparse <- c(
      signature = 4, version_made_by = 2, version_need_to_extract = 2, bit_flag = 2,
      compression_method = 2, last_mod_time = 2, last_mod_date = 2, crc32 = 4, compressed_size = 4, uncompressed_size = 4,
      filename_length = 2, extra_field_length = 2, file_comment_length = 2, disk_num = 2, int_file_attr = 2, ext_file_attr = 4, rel_offset = 4
    )
    deparse <- unlist(purrr::map2(names(deparse), deparse, ~ rep(.x, each = .y)))

    res <- list()
    while (all(head(x, 4) == as.raw(c(0x50, 0x4b, 0x01, 0x02)))) {
      l <- split(head(x, length(deparse)), deparse)
      x <- tail(x, -length(deparse))
      filename_length_int <- raw2ToInt(l$filename_length)
      extra_field_length_int <- raw2ToInt(l$extra_field_length)
      file_comment_length_int <- raw2ToInt(l$file_comment_length)
      l[["filename"]] <- head(x, filename_length_int)
      x <- tail(x, -filename_length_int)
      l[["extra_field"]] <- head(x, extra_field_length_int)
      if (extra_field_length_int != 0) {
        x <- tail(x, -extra_field_length_int)
      }
      l[["file_comment"]] <- head(x, file_comment_length_int)
      if (file_comment_length_int != 0) {
        x <- tail(x, -file_comment_length_int)
      }
      #    res <- c(res, list(tibble::tibble( lapply(l, list))))
      res <- c(res, list(structure(lapply(l, list), row.names = c(
        NA,
        -1L
      ), class = "data.frame")))
    }
    rr <- dplyr::bind_rows(res) |>
      dplyr::mutate(
        filename = purrr::map_chr(filename, rawToChar),
        rel_offset = purrr::map_int(rel_offset, ~ packBits(rawToBits(.x), "integer")),
        compressed_size = purrr::map_int(compressed_size, ~ packBits(rawToBits(.x), "integer")),
        uncompressed_size = purrr::map_int(uncompressed_size, ~ packBits(rawToBits(.x), "integer"))
      )
    rr
  }
  rr <- parsecd(dd)
  rr |>
    dplyr::mutate(next_rel_offset = dplyr::lead(rel_offset)) |>
    dplyr::filter(grepl(pat = "media.csv", filename)) -> file


  depcsv <- request("https://zenodo.org/records/10671148/files/pilot2.zip") |>
    req_headers(Range = glue::glue("bytes={file$rel_offset}-{file$next_rel_offset-1}")) |>
    req_perform() |>
    purrr::chuck("body")

  deparself <- c(
    signature = 4, version_need_to_extract = 2, bit_flag = 2,
    compression_method = 2, last_mod_time = 2, last_mod_date = 2,
    crc32 = 4, compressed_size = 4, uncompressed_size = 4,
    filename_length = 2, extra_field_length = 2
  )
  l <- list()
  for (i in names(deparself)) {
    l[[i]] <- head(depcsv, deparself[i])
    depcsv <- tail(depcsv, -deparself[i])
  }
  filename_length_int <- raw2ToInt(l$filename_length)
  extra_field_length_int <- raw2ToInt(l$extra_field_length)

  l[["filename"]] <- head(depcsv, filename_length_int)
  depcsv <- tail(depcsv, -filename_length_int)
  l[["extra_field"]] <- head(depcsv, extra_field_length_int)
  if (extra_field_length_int != 0) {
    depcsv <- tail(depcsv, -extra_field_length_int)
  }
  c(as.raw(0x78), as.raw(0x01), depcsv) |>
    zip::inflate() |>
    purrr::chuck("output") -> rawInflated
  a <- vroom::vroom(rawConnection(rawInflated))
})
#> Rows: 365 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (6): mediaID, deploymentID, captureMethod, filePath, fileName, fileMedi...
#> lgl  (4): filePublic, exifData, favorite, mediaComments
#> dttm (1): timestamp
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#>    user  system elapsed 
#>   2.697   0.053   3.217
dplyr::glimpse(a)
#> Rows: 365
#> Columns: 11
#> $ mediaID       <chr> "10b0e4da-ca2d-4026-8574-bff8d15a3dcb", "5974ba99-73ed-4…
#> $ deploymentID  <chr> "AWD_1_13082021_pilot 46576a8c-019a-4dd8-852e-86380e0973…
#> $ captureMethod <chr> "activityDetection", "activityDetection", "activityDetec…
#> $ timestamp     <dttm> 2021-08-14 00:35:58, 2021-08-14 00:35:59, 2021-08-14 00…
#> $ filePath      <chr> "media\\AWD_1_13082021_pilot 46576a8c-019a-4dd8-852e-863…
#> $ filePublic    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ fileName      <chr> "10b0e4da-ca2d-4026-8574-bff8d15a3dcb.JPG", "5974ba99-73…
#> $ fileMediatype <chr> "image/jpeg", "image/jpeg", "image/jpeg", "image/jpeg", …
#> $ exifData      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ favorite      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ mediaComments <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

note this does only support classical zip files and not the larger zip64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants