-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hybrid AI Exploration #5
Comments
PDF of presentation slides: |
@mmccool As discussed in today's call, here are two examples of audio models that would benefit from improved storage and caching mechanisms, primarily due to their reliance on sub-models and/or adapters: MMS:mms-1b-all is a 1B parameter model that uses adapters (~2M parameters each) to enable automatic speech recognition across over 1000 languages. SeamlessM4T:As stated in the HF model docs:
|
Discussed on WebML WG Teleconference – 7 March 2024. Thanks to the authors for the presentation and the entire group for your feedback that will inform the direction of this exploration. |
I'm interested in the problem of sharing big / relatively big models across sites. Past a certain size, this problem puts the viability of client-side AI/ML under question, even if the device is more than capable of running the model. While it's really hard to solve the generic problem of sharing common resources across origins, I'm hopeful that we can find a solution for AI/ML models. In particular, I believe that the following elements would help:
With these elements, one could design something where:
I believe that these elements would avoid some of the problems with trying to tack on cross-origin sharing on top of what currently exists for regular web resources:
I'm curious to heard what folks think about this high level approach. |
Thank you @xenova and @KenjiBaheux for your insights, much appreciated. The project team has acknowledged your input and my expectation is the team will share updates on their progress in this issue and will check back with you. We may also schedule another group discussion in the near future. On another related topic, to everyone watching, please note a newly published write-up Understanding and managing the impact of Machine Learning models on the Web by @dontcallmedom that welcomes review and feedback. This document in part discusses topics that intersect with this Hybrid AI exploration and may provide complementary perspectives to this exploration, quoting:
Thank you @dontcallmedom for producing this document. |
Hello all, thanks for the great discussions above, I'm Jason Mayes, Web AI Lead at Google - just wanted to weigh in with some thoughts I have been thinking about the past few years given we now have this centralized space for discussion:
a) In the first instance you have models either running on client side machine or server. Some sort of check occurs to see if machine is powerful enough to run a given model, the model is downloaded, and inference happens entirely locally on device. If the device is not powerful enough, those lower power devices fall back to a server side API for inference (hence hybrid approach). b) In the second instance I also see a hybrid approach evolving in the name of model security, where by the model itself is split over the client and server. Lets take a simple multi layered perceptron style model. In this hypothetical example maybe you run the lower layers of the model on the client side, which allows you to somewhat encode the raw data into some high level embedding representation which is quite nice for the user as they gain some level of privacy of the raw data (though I guess a dedicated attacker could somehow reverse engineer depending on model architecture), and then the final classification head is kept on the server. This means if the model is stolen from the client side for a proprietary model it is not terribly useful to the person who stole it without the classification head. The benefits of this approach are that the company providing the service get model security while also offloading compute to the client side for significant cost savings (that will get better with time as hardware evolves), and client is not sending raw data to server (some level of privacy retained).
a) They can call / query the model via a standardized API for common base models that can be relied upon
Cheers. Would love to hear your thoughts. |
@jasonmayes thanks for joining the discussion and sharing your insights! There are many exciting opportunities to explore. I've asked the project team to follow up with a summary of feedback provided so we can refine the next steps together. I'll invite you to our future meetings when this topic is on the agenda next, as well as others interested. The project team obviously agrees with your prediction that 2024 will be the year of Hybrid AI approaches. Your article is acknowledged in the references :-) Also thanks for your contributions over the years, including the Opportunities & Challenges for TensorFlow.js and beyond talk at our 2020 workshop that informed the creation of the WebML WG and influenced the technical direction of the WebNN API. Looking forward to creating more awesome things together in this space. |
Thanks for your input! Here is a summary of the comments above as we understand them. If there are any points we missed please let us know. It would also be helpful to know which of these are higher priority.
|
Discussed on WebML WG Teleconference – 21 March 2024. Summary: Acknowledged the insightful feedback provided in this issue, noted the summary of the feedback is available. The project team to share a proposal for the initial technical approaches to be explored with a prototype for further review and comment. |
We felt it would be helpful if we summarized the technical approach we are exploring (although – this is just a prototype to test-fly some ideas, and feedback is welcome), and then describe how this would address (some of) the points mentioned. There were several issues raised, and we feel we should prioritize. We will start with the “large model”, “cross-origin”, “adapter”, “models bound to URLs”, and the “built-in foundational model” problems. Our basic idea is to cache individual nodes in the computational graph (specifically weight/bias tensors, the majority of the storage cost) separately, using keys (specifically, hashes) based on their content. This is similar to the Service Worker caches already tested, however we are looking at an approach that can be cross origin and is keyed by each node’s content, not the URL it is loaded from. The advantage of this approach is that it can be implemented at the API level and so is independent of the serialization. We would be computing hashes over the buffer contents passed to the API, not the serialization string. This also would automatically account for models sharing components or foundational models, as long as e.g. adapters are expressed as part of the model graph (e.g. as constant tensor expressions) and not baked into other tensors. It would also optimize the downloading of sub-models if those share components (e.g. embedders/encoders/decoders). The way the API works in practice would also support “built-in” models. Basically, the cache API would be extended to allow “loading” a particular node given its hash. If it exists in the cache, OR is a built-in model, the API call would succeed. If it is not available locally, then the call would fail. In this case, the application would catch the error and have to download that particular node. We are also considering extensions that would allow entire graphs or collections of nodes to be hashed and cached as a group (again, with “built-in” models behaving as if they were “already” in the cache – but caching full graphs in these cases would provide better protection). This does, however, have its own problems. First, it leads to a need for “modular” file formats and representations of models so that nodes can be downloaded separately. It is, however, not too difficult to automatically expand current file formats into parts on the server. It also means for adapters to work with the cache developers should not bake them in, but express them as constant computations. On the other hand, this gives benefits like sharing parts among models and downloading parts in parallel. Finally, the client code needs to know the hashes of the nodes it wants. However, this is the same as needing to know the URLs of the model, but hashes avoid being tied to particular servers. In practice hashes can be baked into the code or stored in metadata files. We feel that some of the other pain points mentioned, e.g. version management and “category-based” selection of models can be built on top of this capability. For example, a semantic versioning system could have a database of hashes and provide an interface to select a model from a version wildcard, e.g. “1.3.*”. Do people feel this direction is worth exploring? Does anyone see any specific problems with the above approach? |
@mmccool That is an interesting approach. I had not considered the hashing of sub graphs of the model to cache and then download only the subgraphs that are missing. If that can actually work in a way that is compatible with common converted model formats that could be interesting (I may be biased here due to my exposure with people I have interacted with so please do expand as needed) but right now I see that as:
In the first 2 cases - this is likely more well defined, with point 3 being quite wide in how that may come about. Again please do extrapolate from here though as I can only comment on the things I have seen myself - I am sure there may be others that emerge or exist that I am unaware of. On that note however, it may be easier to offer an official "conversion" binary that can take saved models from these common formats and "compile them" to a web safe format to be used with this new Web AI standard that would work with such a proposed implementation, That way if something new comes along in the future it could be supported if critical mass for usage is obtained by the web community / proves to be useful. The downside is of course is as new things come one would need to add a conversion path if it was something substantially new/different where no other converters exist. |
Agreed, we need to figure out how this will work with existing model representations and file formats. We are looking into the details of the systems you mentioned as well as how Hugging Face represents models. We want to avoid defining yet another model representation. We are still finding our way around the various model representations and would appreciate any input or guidance you and others with more experience can provide. That said, we feel there are couple of possible approaches here. It seems most existing file formats allow for separate storage of tensors and metadata/topology already, in fact this seems necessary for large models due to buffer size limitations. Most representations also seem flexible enough to accommodate additional metadata in their "header" (or whatever part of the representation gets loaded first before the weights are). So one option would be to add hash metadata to the headers of existing model representations. This can be done over time - if the hash metadata does not exist in a particular model representation, it will still work, but the browser may download the model redundantly. This will still populate the cache if the model is not already in it, which will benefit any later use, even for another site using the same model. However, developers should be motivated to add the necessary metadata to files since it will improve the user experience of their users by avoiding wait times for downloads. It seems that some representations already include hashes for validation purposes so this is not that large a change. Of course our hashes could also support validation of downloads. With this approach a "converter" could just "upgrade" a model by adding hash metadata. If updating existing representations is not possible, or if a model with an "older" format is to be loaded, then hash metadata can be computed and stored separately, perhaps as part of a manifest file. Hugging Face already has JSON manifests for nodes, for example, and the hashes could be stored there. In this case the "converter" would just generate a relatively simple manifest file containing the hashes associated with each node and/or the entire graph. In practice, if only the hash for the entire graph is needed, it can be embedded in the JS code (just like the name of the model or URL would be). It would be good to know if you had any particular models in mind to look at for test cases. For example, we have been looking at mistral-7b and models derived from it with adapters, and how this model is represented in different formats. All comments and feedback welcome. |
Per our discussion a dedicated repo has been created under the Web Machine Learning Community Group to continue this discussion in a structured manner (i.e. discussions split into topic-specific issues etc.): 🆕 https://github.com/webmachinelearning/hybrid-ai Thanks everyone for your feedback and comments! Please watch the new repo. I added a basic readme with ground rules. Simply put, the new repo is for discussion on Hybrid AI topics and possible specification incubation work that may follow needs a recharter. @grgustaf and @mmccool please migrate applicable content from this proposal issue to the dedicated repo and loop interested folks in. You can close this issue when the migration is completed. Thank you! |
We are moving this content to the above repo and reorganizing it. Please go to https://github.com/webmachinelearning/hybrid-ai for further comments. We will leave this issue open for now. |
Hybrid AI Exploration
Authors
Introduction
ML on the client supports many use cases better than server-based approaches, and with lower cost for the application provider. However, clients can vary significantly in capabilities. A hybrid approach that can flexibly shift work between server and client can support elasticity and avoid the problem of developers targeting only the weakest clients’ capabilities.
The overall goal of hybrid AI is to maximize the user experience in machine learning applications by providing the web developer the tools to manage the distribution of data and compute resources between servers and the client.
For example, ML models are large. This creates network cost, transfer time, and storage problems. As mentioned, client capabilities can vary. This creates adaptation, partitioning, and versioning problems. We would like to discuss potential solutions to these problems, such as shared caches, progressive model updates, and capability/requirements negotiation.
Requirements and Goals
For the end user, most of the existing WebNN use cases share common user requirements:
Even though it is not a primary requirement, developer ease of use is a factor for adoption. An approach that easily allows a developer to shift load between the server and the client using simple, consistent abstractions will allow for more Hybrid AI applications to be developed faster than one with completely different programming models.
Open Issues
Current implementations of hybrid AI applications (see User Research and References) have the following problems when targeting many of the WebNN use cases:
Non-goals
User Research and References
While these emphasize edge computing (offload from the client) several can also be interpreted as use cases simply needing additional performance on the client.
The text was updated successfully, but these errors were encountered: