Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-9195] ota async task #350

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

mattjperez
Copy link
Member

@mattjperez mattjperez commented Nov 14, 2024

The following PR:

  • RSDK-9195: moves the ota logic to an async task attached to viam-server (this will enable retries/incremental-download/etc)
  • RSDK-9198: uses async hyper::http2 instead of the blocking EspHttpClient

The errors will get handled better as part of RSDK-9214

@mattjperez mattjperez self-assigned this Nov 14, 2024
@mattjperez mattjperez requested a review from a team as a code owner November 14, 2024 17:18
Comment on lines 89 to 90
bincode = "2.0.0-rc.3"

Copy link
Member Author

@mattjperez mattjperez Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this an optional dependency for ota feature moving forward. That way it doesn't get into the regular build

will rm ws

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to just being compiled when target isn't esp32 for now. the ticket for having a native representation for the app description might use a different crate eventually so defering until then.

@@ -622,6 +600,42 @@ where
}
}

#[cfg(feature = "ota")]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the whole ota section to after certs have been retrieved and stored

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need certs? I thought the certs were for us serving

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I am using the same certs for the hyper client's https connection. should I not be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try the empirical answer: does it still work if you don't use them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you were right. we just need the certificate authority, which the H2 connector is already using. removed the certificates completely.

Comment on lines 83 to 101
struct EspAppDesc {
//TODO verify this is doing what I think it's doing, validate results
/// ESP_APP_DESC_MAGIC_WORD (0xABCD5432)
magic_word: u32,
secure_version: u32,
reserv1: [u32; 2],
/// application version
version: [u8; 32],
project_name: [u8; 32],
/// compile time
time: [u8; 16],
/// compile date
date: [u8; 16],
idf_ver: [u8; 32],
app_elf_sha256: [u8; 32],
/// minimal eFuse block revision supported by image, in format: major * 100 + minor
min_efuse_blk_rev_full: u16,
/// maximal eFuse block revision supported by image, in format: major * 100 + minor
max_efuse_blk_rev_full: u16,
reserv2: [u32; 19],
}
Copy link
Member Author

@mattjperez mattjperez Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this for testing in native, however we can maybe use it in place of esp_app_desc_t entirely.
Or if we want to just use the c ffi for esp_app_desc_t we can do an unsafe transmute on the first 256 bytes of an image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, let's make a ticket to revisit. I'd prefer we ultimately use the C FFI one when possible, and declare our own for native. Ideally with some sanity checks on esp32 that the native one seems to line up with the FFI one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link the ticket in here once you have filed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EspAppDesc,
bincode::config::Configuration,
>(
&data[..256], bincode::config::standard()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change this to size_of::<EspAppDesc>

.await
.map_err("failed to write data to partition".to_string())?;
// TODO change back to 'n' after impl async writer
nwritten += data.len();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will probably take the length before writing, but seems to still work

Copy link
Member

@gvaradarajan gvaradarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly LGTM actually, just a few questions

let mut loader = EspFirmwareInfoLoader::new();
loader.load(&data).map_err(OtaError::EspError)?;
let new_fw = loader.get_info().map_err(OtaError::EspError)?;
log::info!("current firmware: {:?}", running_fw_info);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log::debug instead (for both statements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall though, I do feel like this method should probably have some well thought out user facing logs that provide context about what the OTA system is doing.

micro-rdk/src/common/ota.rs Show resolved Hide resolved
let (mut sender, conn) = {
http2::Builder::new(self.exec.clone())
.keep_alive_interval(Some(std::time::Duration::from_secs(120)))
.keep_alive_timeout(std::time::Duration::from_secs(10))
Copy link
Member

@gvaradarajan gvaradarajan Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the minimum amount of time we can use for this timeout? If a smaller timeout works, I think it would be preferred

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a lower limit you have in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I figured we could arrive at this by lowering the duration until it no longer works on a relatively unstable network (probably do something more like binary search)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that we know we need to adjust these from the defaults?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed working without setting keep_alive_interval and using default timeout (20 seconds).

micro-rdk/src/common/ota.rs Outdated Show resolved Hide resolved

#[cfg(feature = "esp32")]
type OtaConnector = crate::esp32::tcp::Esp32H2Connector;
#[cfg(not(feature = "esp32"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I forgot from the previous PR, but why are we allowing cases where we're not on an esp32? Isn't OTA meaningless otherwise?

Copy link
Member Author

@mattjperez mattjperez Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faster iteration when testing, specifically around hyper, config checks, and inspecting the binary to be downloaded (recommended by nico)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair for development purposes maybe, but isn't this confusing for merged-in code? It took me a while to see that OTA compiles for native but makes http calls that essentially do nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by nothing. The calls are for downloading the OTA binary and inspecting it. This would also be useful for verifying future work we talked about for editing the binary signatures directly.

Should I put it back and take it out between every PR for this feature? Or clean it up at the end if it doesn't prove useful beyond the planned work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to that point, what is the purpose of having a native build?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the native build still tests the connection infrastructure abstraction and acts as a server. In native "OTA", there's no updating happening so it doesn't feel like a fair comparison. But we can leave it in and revisit at the end of the epic

micro-rdk/src/common/conn/viam.rs Outdated Show resolved Hide resolved

// TODO(RSDK-9200): set according to active partition scheme
const OTA_MAX_IMAGE_SIZE: usize = 1024 * 1024 * 4; // 4MB
pub const OTA_MODEL_TYPE: &str = "ota_service";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be a slice of OTA_MODEL_TRIPLET so they can't get out of phase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing that initially, but was having issues with using a const str to define another const str, will look into it again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accomplished using once_cell::sync::Lazy, though not a fan of the deref, but it works.

Comment on lines 83 to 101
struct EspAppDesc {
//TODO verify this is doing what I think it's doing, validate results
/// ESP_APP_DESC_MAGIC_WORD (0xABCD5432)
magic_word: u32,
secure_version: u32,
reserv1: [u32; 2],
/// application version
version: [u8; 32],
project_name: [u8; 32],
/// compile time
time: [u8; 16],
/// compile date
date: [u8; 16],
idf_ver: [u8; 32],
app_elf_sha256: [u8; 32],
/// minimal eFuse block revision supported by image, in format: major * 100 + minor
min_efuse_blk_rev_full: u16,
/// maximal eFuse block revision supported by image, in format: major * 100 + minor
max_efuse_blk_rev_full: u16,
reserv2: [u32; 19],
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, let's make a ticket to revisit. I'd prefer we ultimately use the C FFI one when possible, and declare our own for native. Ideally with some sanity checks on esp32 that the native one seems to line up with the FFI one.

reserv2: [u32; 19],
}

// TODO(RSDK-9214)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When TODO(TICKET) please also include a little explanation of what is TODO. It can be brief, but that way the reader can get a sense without needing to jump over to the ticket.

cert: TlsCertificate,
exec: Executor,
) -> Result<Self, OtaError> {
// TODO(RSDK-9205)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto more details on comment

let mut total_downloaded: usize = 0;
let mut got_info = false;

while let Some(next) = response.frame().await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any ability to exercise control over the max size of the frames we will get?

Copy link
Member Author

@mattjperez mattjperez Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears the closest thing would be the range header, to request a specific range of bytes to be served (could be useful for resumable download). this would turn into mutiple requests though, one for each segment of our max size.

However, there's no guarantee that the server supports ranges and that would have to be verified and handled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if total_downloaded < core::mem::size_of::<EspAppDesc>() {
log::error!("initial frame too small to retrieve esp_app_desc_t");
} else {
log::info!("data length {}", data.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a debug log, at best, or better removed

let mut loader = EspFirmwareInfoLoader::new();
loader.load(&data).map_err(OtaError::EspError)?;
let new_fw = loader.get_info().map_err(OtaError::EspError)?;
log::info!("current firmware: {:?}", running_fw_info);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall though, I do feel like this method should probably have some well thought out user facing logs that provide context about what the OTA system is doing.

log::info!("new firmware: {:?}", new_fw);
if let Some(ref running_fw) = running_fw_info {
if running_fw.version == new_fw.version
&& running_fw.released == new_fw.released
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is released here and what does it mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the compiled timestamp we talked about before

@@ -622,6 +600,42 @@ where
}
}

#[cfg(feature = "ota")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need certs? I thought the certs were for us serving

Copy link
Member

@gvaradarajan gvaradarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -219,7 +219,7 @@ impl Future for AsyncioSleep {
}

#[derive(Default, Clone, Debug)]
struct H2Timer;
pub struct H2Timer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate) instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants