Outlier scores for HDBSCAN #73

azizkayumov · 2024-07-21T13:57:24Z

Closes #71

Summary by CodeRabbit

New Features
- Enhanced the HDbscan clustering algorithm to include outlier scores in the output.
- Introduced a new function for computing outlier scores based on hierarchical clustering results.
Bug Fixes
- Corrected the error message for corrupted data in the main function.

codecov · 2024-07-21T14:01:07Z

Codecov Report

Attention: Patch coverage is 75.67568% with 9 lines in your changes missing coverage. Please review.

Project coverage is 80.56%. Comparing base (9d2c1ca) to head (d551c5c).
Report is 3 commits behind head on main.

Files	Patch %	Lines
src/hdbscan.rs	75.67%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
- Coverage   81.12%   80.56%   -0.56%     
==========================================
  Files           4        4              
  Lines         641      674      +33     
==========================================
+ Hits          520      543      +23     
- Misses        121      131      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai · 2024-10-14T22:44:45Z

Walkthrough

The changes made in this pull request significantly modify the HDbscan clustering algorithm. The fit method is updated to return an additional vector for outlier scores, and its logic is adjusted to compute these scores using a new function called glosh. A helper function, max_lambdas, is also introduced, and a new test case for the glosh function is added to ensure the accuracy of outlier score calculations.

Changes

Files	Change Summary
src/hdbscan.rs	Modified `fit` method to return outlier scores and updated its logic to compute these scores using `glosh`. Added `glosh` and `max_lambdas` functions, along with a test case for `glosh`.
examples/hdbscan.rs	Corrected error message for corrupted data and updated `fit` method return type to include outlier scores.

Assessment against linked issues

Objective	Addressed	Explanation
Implement outlier scores for HDBSCAN (#71)	✅

🐇 In the meadow, where clusters bloom,
Outlier scores now find their room.
With glosh and max, they dance and play,
HDBSCAN shines bright, come what may!
🌼✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)

examples/hdbscan.rs (1)
Line range hint 1-91: Enhance example with error handling and explanatory comment.

To improve the example and make it more robust:

Add error handling for the fit method call. This ensures that the example gracefully handles any potential errors in the clustering process, including those that might arise from the new outlier score computation.

Include a brief comment at the beginning of the file explaining the purpose of the example and mentioning the new outlier scores feature. This would help users understand the context and new capabilities demonstrated in the example.

Here's a suggestion for these improvements:
+// This example demonstrates the usage of the HDbscan clustering algorithm,
+// including the newly added outlier score computation feature.

 use std::{env, fs::File, process::exit};

 use csv::ReaderBuilder;
 use ndarray::Array2;
 use petal_clustering::{Fit, HDbscan};
 use petal_neighbors::distance::Euclidean;

 fn main() {
     // ... (existing code)

-    let (clusters, outliers, _outlier_scores) = clustering.fit(&data.view());
+    let (clusters, outliers, outlier_scores) = clustering.fit(&data.view()).expect("Clustering failed");

     // ... (rest of the code)
 }
These changes would make the example more informative and robust.
src/hdbscan.rs (2)

48-49: Update method signature documentation for fit method

The fit method now returns an additional Vec<A> containing outlier scores. Please update the method's documentation and associated comments to reflect this change, ensuring that users are aware of the new return value.

401-442: Standardize notation and clarify documentation in glosh function

The comments within the glosh function mix eps and lambda notations, which might be confusing. Consider standardizing the notation throughout the comments and code to improve readability. Additionally, ensure that mathematical equations are clearly formatted for better understanding.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 97c1824 and 560ad58.

📒 Files selected for processing (2)

examples/hdbscan.rs (1 hunks)
src/hdbscan.rs (5 hunks)

🧰 Additional context used

🔇 Additional comments (2)

examples/hdbscan.rs (1)

42-42: Please clarify the impact of setting boruvka to false.

The boruvka field in the HDbscan struct has been changed from true to false. Could you please explain the reasoning behind this change and its impact on the algorithm's behavior or performance? This information would be helpful for users of the example to understand the implications of this setting.

To verify the usage of the boruvka field across the codebase, please run the following script:

src/hdbscan.rs (1)

1004-1066: Great job adding comprehensive tests for glosh function

The new unit test thoroughly validates the glosh implementation against expected outlier scores, enhancing the reliability and correctness of the algorithm.

examples/hdbscan.rs

coderabbitai · 2024-10-14T22:50:03Z

src/hdbscan.rs

+        let condensed = condense_mst(labeled.view(), self.min_cluster_size);
+        let outlier_scores = glosh(&condensed, self.min_cluster_size);
+        let (clusters, outliers) = find_clusters(&Array1::from_vec(condensed).view());
+        (clusters, outliers, outlier_scores)


🛠️ Refactor suggestion

Optimize conversion of condensed data structure

In the line let (clusters, outliers) = find_clusters(&Array1::from_vec(condensed).view());, consider avoiding the conversion from Vec<(usize, usize, A, usize)> to Array1 if not necessary. Operating directly on the Vec could improve performance by reducing unnecessary allocations.

This suggestion requires a change in the parameter types of find_clusters, which is not the scope of this PR.

@azizkayumov, understood. Thank you for clarifying.

src/hdbscan.rs

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

src/hdbscan.rs (1)
1073-1073: Use approximate equality for floating-point assertions in tests

Floating-point comparisons using assert_eq! may fail due to precision errors. Consider using approximate equality to make the tests more robust.

Replace assertions like:
-    assert_eq!(outlier_scores[i], 0.0);
+    assert!((outlier_scores[i] - 0.0).abs() < 1e-8);
Adjust the epsilon value as appropriate for the required precision.

Also applies to: 1082-1082, 1089-1089

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 560ad58 and 96f9e94.

📒 Files selected for processing (1)

src/hdbscan.rs (5 hunks)

🧰 Additional context used

🪛 GitHub Check: test (macOS-latest, stable)

src/hdbscan.rs

[failure] 95-95:
the trait bound A: From<u32> is not satisfied

🪛 GitHub Check: test (ubuntu-latest, 1.74.0)

src/hdbscan.rs

[failure] 95-95:
the trait bound A: From<u32> is not satisfied

🪛 GitHub Check: check

src/hdbscan.rs

[failure] 95-95:
the trait bound A: std::convert::From<u32> is not satisfied

src/hdbscan.rs

sync with main

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 96f9e94 and 0209d70.

📒 Files selected for processing (2)

examples/hdbscan.rs (2 hunks)
src/hdbscan.rs (6 hunks)

🔇 Additional comments (6)

examples/hdbscan.rs (2)

18-18: LGTM! Fixed typo in error message.

The correction from "corruptted" to "corrupted" improves the error message clarity.

44-44: ⚠️ Potential issue

Critical: Outlier scores feature is captured but unused

The main objective of this PR is to implement outlier scores, but the example code currently ignores them by using the _outlier_scores prefix. This doesn't fulfill the PR's purpose of demonstrating the new GLOSH outlier score functionality.

As suggested in the previous review, please modify the example to actually demonstrate the outlier scores:

-    let (clusters, outliers, _outlier_scores) = clustering.fit(&data.view());
+    let (clusters, outliers, outlier_scores) = clustering.fit(&data.view());
     println!("========= Report =========");
     println!("# of events processed: {}", data.nrows());
     println!("# of features provided: {}", data.ncols());
     println!("# of clusters: {}", clusters.len());
     println!(
         "# of events clustered: {}",
         clusters.values().map(|v| v.len()).sum::<usize>(),
     );
     println!("# of outliers: {}", outliers.len());
+    // Display outlier scores statistics
+    if let Some((min, max)) = outlier_scores.iter().fold(None, |acc, &x| {
+        Some(match acc {
+            None => (x, x),
+            Some((min, max)) => (min.min(x), max.max(x)),
+        })
+    }) {
+        println!("Outlier scores range: {:.3} to {:.3}", min, max);
+        println!("Number of points with scores: {}", outlier_scores.len());
+    }

This enhancement would:

Actually use the outlier scores instead of ignoring them
Provide useful statistics about the score distribution
Demonstrate to users how to access and interpret the scores

src/hdbscan.rs (4)

48-60: LGTM! Clean implementation of the outlier scores return value.

The changes to the Fit trait implementation are well-structured and handle edge cases appropriately.

396-419: Well-documented GLOSH implementation!

The documentation clearly explains the algorithm with a helpful visual representation and mathematical formulas.

446-472: LGTM! Efficient implementation of max_lambdas.

The bottom-up traversal approach is well-implemented with clear comments explaining the logic. The vector sizes are correctly computed using the largest parent index.

1131-1192: Excellent test coverage for outlier scores!

The test case is comprehensive and well-documented:

Verifies scores for points in clusters (should be 0)
Validates exact outlier scores with detailed mathematical explanations
Tests multiple scenarios with different cluster configurations

src/hdbscan.rs

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

src/hdbscan.rs (2)
94-97: Consider optimizing the conversion of condensed data structure.

The conversion from Vec to Array1 in find_clusters call might be unnecessary if the function can be modified to accept a slice instead.
-let (clusters, outliers) = find_clusters(&Array1::from_vec(condensed).view());
+let (clusters, outliers) = find_clusters(&condensed);
396-444: Well-documented implementation of GLOSH algorithm.

The implementation correctly follows the GLOSH paper's methodology. The code is well-documented with clear explanations of the mathematical concepts and edge cases.

A few suggestions for improvement:

Consider adding error handling for empty input

Add debug assertions for input validation
 fn glosh<A: FloatCore>(
     condensed_mst: &[(usize, usize, A, usize)],
     min_cluster_size: usize,
 ) -> Vec<A> {
+    debug_assert!(!condensed_mst.is_empty(), "Empty MST provided");
+    debug_assert!(min_cluster_size > 0, "Invalid min_cluster_size");
+
     let deaths = max_lambdas(condensed_mst, min_cluster_size);

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0209d70 and 98641ec.

📒 Files selected for processing (1)

src/hdbscan.rs (6 hunks)

🔇 Additional comments (3)

src/hdbscan.rs (3)

48-60: LGTM: Return type updated to include outlier scores.

The trait implementation has been correctly modified to return outlier scores as part of the tuple. The empty case is properly handled by returning an empty vector.

446-472: LGTM: Efficient implementation of max_lambdas function.

The bottom-up traversal approach is efficient and correctly implements the computation of maximum lambda values. The code is well-commented and handles both single points and clusters appropriately.

1131-1192: Excellent test coverage with detailed mathematical validation.

The test case is comprehensive and well-documented:

Clear dataset structure with distinct clusters and outliers
Detailed mathematical explanations of expected scores
Precise validation of both cluster points and outliers
Edge case coverage with different outlier scenarios

azizkayumov · 2024-11-27T06:58:27Z

@msk Please let me know if these changes look good to you.
There is one AI suggestion remaining, which requires a change in find_clusters function's parameters. This PR is only about glosh and max_lambdas functions, so I am leaving it for now for further approval by other contributors.

msk · 2024-11-29T19:38:08Z

@azizkayumov, thanks for implementing outlier scores for HDBSCAN. Your PR looks good to me, and I think it's ready to be merged.

Just a note that before we release 0.11.0, we should add some more documentation and an example for Hdbscan to help users understand how to use the new outlier scores feature. But that can be done in a separate PR, so it doesn't block this one.

Thanks again for your contribution!

azizkayumov added 3 commits July 21, 2024 02:41

glosh: closes petabi#71

5b6c0df

glosh: simplify

130c95f

glosh: fix bottom-up traversal

d551c5c

azizkayumov added 5 commits July 22, 2024 14:53

glosh: added test

f2ceeff

glosh: fix test info

4502c2f

glosh: visually better test case

5abf2ac

glosh: make gloosh private

9464019

Merge branch 'main' into main

560ad58

coderabbitai bot reviewed Oct 14, 2024

View reviewed changes

Merge branch 'main' into main

96f9e94

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

src/hdbscan.rs Show resolved Hide resolved

src/hdbscan.rs Outdated Show resolved Hide resolved

Glosh (#1)

0209d70

sync with main

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

src/hdbscan.rs Outdated Show resolved Hide resolved

fix and trait bounds

98641ec

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

msk merged commit 205e7e9 into petabi:main Nov 29, 2024
8 checks passed

coderabbitai bot mentioned this pull request Dec 5, 2024

Example usage for outlier scores #82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outlier scores for HDBSCAN #73

Outlier scores for HDBSCAN #73

azizkayumov commented Jul 21, 2024 •

edited by coderabbitai bot

Loading

codecov bot commented Jul 21, 2024

coderabbitai bot commented Oct 14, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Oct 14, 2024 •

edited

Loading

azizkayumov Nov 27, 2024

coderabbitai bot Nov 27, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

azizkayumov commented Nov 27, 2024

msk commented Nov 29, 2024

Outlier scores for HDBSCAN #73

Outlier scores for HDBSCAN #73

Conversation

azizkayumov commented Jul 21, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

codecov bot commented Jul 21, 2024

Codecov Report

coderabbitai bot commented Oct 14, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

azizkayumov Nov 27, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 27, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

azizkayumov commented Nov 27, 2024

msk commented Nov 29, 2024

azizkayumov commented Jul 21, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 14, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Oct 14, 2024 •

edited

Loading