-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outlier scores for HDBSCAN #73
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #73 +/- ##
==========================================
- Coverage 81.12% 80.56% -0.56%
==========================================
Files 4 4
Lines 641 674 +33
==========================================
+ Hits 520 543 +23
- Misses 121 131 +10 ☔ View full report in Codecov by Sentry. |
WalkthroughThe changes made in this pull request significantly modify the Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (3)
examples/hdbscan.rs (1)
Line range hint
1-91
: Enhance example with error handling and explanatory comment.To improve the example and make it more robust:
Add error handling for the
fit
method call. This ensures that the example gracefully handles any potential errors in the clustering process, including those that might arise from the new outlier score computation.Include a brief comment at the beginning of the file explaining the purpose of the example and mentioning the new outlier scores feature. This would help users understand the context and new capabilities demonstrated in the example.
Here's a suggestion for these improvements:
+// This example demonstrates the usage of the HDbscan clustering algorithm, +// including the newly added outlier score computation feature. use std::{env, fs::File, process::exit}; use csv::ReaderBuilder; use ndarray::Array2; use petal_clustering::{Fit, HDbscan}; use petal_neighbors::distance::Euclidean; fn main() { // ... (existing code) - let (clusters, outliers, _outlier_scores) = clustering.fit(&data.view()); + let (clusters, outliers, outlier_scores) = clustering.fit(&data.view()).expect("Clustering failed"); // ... (rest of the code) }These changes would make the example more informative and robust.
src/hdbscan.rs (2)
48-49
: Update method signature documentation forfit
methodThe
fit
method now returns an additionalVec<A>
containing outlier scores. Please update the method's documentation and associated comments to reflect this change, ensuring that users are aware of the new return value.
401-442
: Standardize notation and clarify documentation inglosh
functionThe comments within the
glosh
function mixeps
andlambda
notations, which might be confusing. Consider standardizing the notation throughout the comments and code to improve readability. Additionally, ensure that mathematical equations are clearly formatted for better understanding.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- examples/hdbscan.rs (1 hunks)
- src/hdbscan.rs (5 hunks)
🧰 Additional context used
🔇 Additional comments (2)
examples/hdbscan.rs (1)
42-42
: Please clarify the impact of settingboruvka
tofalse
.The
boruvka
field in theHDbscan
struct has been changed fromtrue
tofalse
. Could you please explain the reasoning behind this change and its impact on the algorithm's behavior or performance? This information would be helpful for users of the example to understand the implications of this setting.To verify the usage of the
boruvka
field across the codebase, please run the following script:src/hdbscan.rs (1)
1004-1066
: Great job adding comprehensive tests forglosh
functionThe new unit test thoroughly validates the
glosh
implementation against expected outlier scores, enhancing the reliability and correctness of the algorithm.
let condensed = condense_mst(labeled.view(), self.min_cluster_size); | ||
let outlier_scores = glosh(&condensed, self.min_cluster_size); | ||
let (clusters, outliers) = find_clusters(&Array1::from_vec(condensed).view()); | ||
(clusters, outliers, outlier_scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Optimize conversion of condensed
data structure
In the line let (clusters, outliers) = find_clusters(&Array1::from_vec(condensed).view());
, consider avoiding the conversion from Vec<(usize, usize, A, usize)>
to Array1
if not necessary. Operating directly on the Vec
could improve performance by reducing unnecessary allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion requires a change in the parameter types of find_clusters
, which is not the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@azizkayumov, understood. Thank you for clarifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
src/hdbscan.rs (1)
1073-1073
: Use approximate equality for floating-point assertions in testsFloating-point comparisons using
assert_eq!
may fail due to precision errors. Consider using approximate equality to make the tests more robust.Replace assertions like:
- assert_eq!(outlier_scores[i], 0.0); + assert!((outlier_scores[i] - 0.0).abs() < 1e-8);Adjust the epsilon value as appropriate for the required precision.
Also applies to: 1082-1082, 1089-1089
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/hdbscan.rs
(5 hunks)
🧰 Additional context used
🪛 GitHub Check: test (macOS-latest, stable)
src/hdbscan.rs
[failure] 95-95:
the trait bound A: From<u32>
is not satisfied
🪛 GitHub Check: test (ubuntu-latest, 1.74.0)
src/hdbscan.rs
[failure] 95-95:
the trait bound A: From<u32>
is not satisfied
🪛 GitHub Check: check
src/hdbscan.rs
[failure] 95-95:
the trait bound A: std::convert::From<u32>
is not satisfied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
examples/hdbscan.rs
(2 hunks)src/hdbscan.rs
(6 hunks)
🔇 Additional comments (6)
examples/hdbscan.rs (2)
18-18
: LGTM! Fixed typo in error message.
The correction from "corruptted" to "corrupted" improves the error message clarity.
44-44
:
Critical: Outlier scores feature is captured but unused
The main objective of this PR is to implement outlier scores, but the example code currently ignores them by using the _outlier_scores
prefix. This doesn't fulfill the PR's purpose of demonstrating the new GLOSH outlier score functionality.
As suggested in the previous review, please modify the example to actually demonstrate the outlier scores:
- let (clusters, outliers, _outlier_scores) = clustering.fit(&data.view());
+ let (clusters, outliers, outlier_scores) = clustering.fit(&data.view());
println!("========= Report =========");
println!("# of events processed: {}", data.nrows());
println!("# of features provided: {}", data.ncols());
println!("# of clusters: {}", clusters.len());
println!(
"# of events clustered: {}",
clusters.values().map(|v| v.len()).sum::<usize>(),
);
println!("# of outliers: {}", outliers.len());
+ // Display outlier scores statistics
+ if let Some((min, max)) = outlier_scores.iter().fold(None, |acc, &x| {
+ Some(match acc {
+ None => (x, x),
+ Some((min, max)) => (min.min(x), max.max(x)),
+ })
+ }) {
+ println!("Outlier scores range: {:.3} to {:.3}", min, max);
+ println!("Number of points with scores: {}", outlier_scores.len());
+ }
This enhancement would:
- Actually use the outlier scores instead of ignoring them
- Provide useful statistics about the score distribution
- Demonstrate to users how to access and interpret the scores
src/hdbscan.rs (4)
48-60
: LGTM! Clean implementation of the outlier scores return value.
The changes to the Fit
trait implementation are well-structured and handle edge cases appropriately.
396-419
: Well-documented GLOSH implementation!
The documentation clearly explains the algorithm with a helpful visual representation and mathematical formulas.
446-472
: LGTM! Efficient implementation of max_lambdas.
The bottom-up traversal approach is well-implemented with clear comments explaining the logic. The vector sizes are correctly computed using the largest parent index.
1131-1192
: Excellent test coverage for outlier scores!
The test case is comprehensive and well-documented:
- Verifies scores for points in clusters (should be 0)
- Validates exact outlier scores with detailed mathematical explanations
- Tests multiple scenarios with different cluster configurations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
src/hdbscan.rs (2)
94-97
: Consider optimizing the conversion of condensed data structure.The conversion from
Vec
toArray1
infind_clusters
call might be unnecessary if the function can be modified to accept a slice instead.-let (clusters, outliers) = find_clusters(&Array1::from_vec(condensed).view()); +let (clusters, outliers) = find_clusters(&condensed);
396-444
: Well-documented implementation of GLOSH algorithm.The implementation correctly follows the GLOSH paper's methodology. The code is well-documented with clear explanations of the mathematical concepts and edge cases.
A few suggestions for improvement:
- Consider adding error handling for empty input
- Add debug assertions for input validation
fn glosh<A: FloatCore>( condensed_mst: &[(usize, usize, A, usize)], min_cluster_size: usize, ) -> Vec<A> { + debug_assert!(!condensed_mst.is_empty(), "Empty MST provided"); + debug_assert!(min_cluster_size > 0, "Invalid min_cluster_size"); + let deaths = max_lambdas(condensed_mst, min_cluster_size);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/hdbscan.rs
(6 hunks)
🔇 Additional comments (3)
src/hdbscan.rs (3)
48-60
: LGTM: Return type updated to include outlier scores.
The trait implementation has been correctly modified to return outlier scores as part of the tuple. The empty case is properly handled by returning an empty vector.
446-472
: LGTM: Efficient implementation of max_lambdas function.
The bottom-up traversal approach is efficient and correctly implements the computation of maximum lambda values. The code is well-commented and handles both single points and clusters appropriately.
1131-1192
: Excellent test coverage with detailed mathematical validation.
The test case is comprehensive and well-documented:
- Clear dataset structure with distinct clusters and outliers
- Detailed mathematical explanations of expected scores
- Precise validation of both cluster points and outliers
- Edge case coverage with different outlier scenarios
@msk Please let me know if these changes look good to you. |
@azizkayumov, thanks for implementing outlier scores for HDBSCAN. Your PR looks good to me, and I think it's ready to be merged. Just a note that before we release 0.11.0, we should add some more documentation and an example for Thanks again for your contribution! |
Closes #71
Summary by CodeRabbit
New Features
HDbscan
clustering algorithm to include outlier scores in the output.Bug Fixes