Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier scores for HDBSCAN #73

Merged
merged 11 commits into from
Nov 29, 2024
Merged

Outlier scores for HDBSCAN #73

merged 11 commits into from
Nov 29, 2024

Conversation

azizkayumov
Copy link
Contributor

@azizkayumov azizkayumov commented Jul 21, 2024

Closes #71

Summary by CodeRabbit

  • New Features

    • Enhanced the HDbscan clustering algorithm to include outlier scores in the output.
    • Introduced a new function for computing outlier scores based on hierarchical clustering results.
  • Bug Fixes

    • Corrected the error message for corrupted data in the main function.

Copy link

codecov bot commented Jul 21, 2024

Codecov Report

Attention: Patch coverage is 75.67568% with 9 lines in your changes missing coverage. Please review.

Project coverage is 80.56%. Comparing base (9d2c1ca) to head (d551c5c).
Report is 3 commits behind head on main.

Files Patch % Lines
src/hdbscan.rs 75.67% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
- Coverage   81.12%   80.56%   -0.56%     
==========================================
  Files           4        4              
  Lines         641      674      +33     
==========================================
+ Hits          520      543      +23     
- Misses        121      131      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

coderabbitai bot commented Oct 14, 2024

Walkthrough

The changes made in this pull request significantly modify the HDbscan clustering algorithm. The fit method is updated to return an additional vector for outlier scores, and its logic is adjusted to compute these scores using a new function called glosh. A helper function, max_lambdas, is also introduced, and a new test case for the glosh function is added to ensure the accuracy of outlier score calculations.

Changes

Files Change Summary
src/hdbscan.rs Modified fit method to return outlier scores and updated its logic to compute these scores using glosh. Added glosh and max_lambdas functions, along with a test case for glosh.
examples/hdbscan.rs Corrected error message for corrupted data and updated fit method return type to include outlier scores.

Assessment against linked issues

Objective Addressed Explanation
Implement outlier scores for HDBSCAN (#71)

🐇 In the meadow, where clusters bloom,
Outlier scores now find their room.
With glosh and max, they dance and play,
HDBSCAN shines bright, come what may!
🌼✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)
examples/hdbscan.rs (1)

Line range hint 1-91: Enhance example with error handling and explanatory comment.

To improve the example and make it more robust:

  1. Add error handling for the fit method call. This ensures that the example gracefully handles any potential errors in the clustering process, including those that might arise from the new outlier score computation.

  2. Include a brief comment at the beginning of the file explaining the purpose of the example and mentioning the new outlier scores feature. This would help users understand the context and new capabilities demonstrated in the example.

Here's a suggestion for these improvements:

+// This example demonstrates the usage of the HDbscan clustering algorithm,
+// including the newly added outlier score computation feature.

 use std::{env, fs::File, process::exit};

 use csv::ReaderBuilder;
 use ndarray::Array2;
 use petal_clustering::{Fit, HDbscan};
 use petal_neighbors::distance::Euclidean;

 fn main() {
     // ... (existing code)

-    let (clusters, outliers, _outlier_scores) = clustering.fit(&data.view());
+    let (clusters, outliers, outlier_scores) = clustering.fit(&data.view()).expect("Clustering failed");

     // ... (rest of the code)
 }

These changes would make the example more informative and robust.

src/hdbscan.rs (2)

48-49: Update method signature documentation for fit method

The fit method now returns an additional Vec<A> containing outlier scores. Please update the method's documentation and associated comments to reflect this change, ensuring that users are aware of the new return value.


401-442: Standardize notation and clarify documentation in glosh function

The comments within the glosh function mix eps and lambda notations, which might be confusing. Consider standardizing the notation throughout the comments and code to improve readability. Additionally, ensure that mathematical equations are clearly formatted for better understanding.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 97c1824 and 560ad58.

📒 Files selected for processing (2)
  • examples/hdbscan.rs (1 hunks)
  • src/hdbscan.rs (5 hunks)
🧰 Additional context used
🔇 Additional comments (2)
examples/hdbscan.rs (1)

42-42: Please clarify the impact of setting boruvka to false.

The boruvka field in the HDbscan struct has been changed from true to false. Could you please explain the reasoning behind this change and its impact on the algorithm's behavior or performance? This information would be helpful for users of the example to understand the implications of this setting.

To verify the usage of the boruvka field across the codebase, please run the following script:

src/hdbscan.rs (1)

1004-1066: Great job adding comprehensive tests for glosh function

The new unit test thoroughly validates the glosh implementation against expected outlier scores, enhancing the reliability and correctness of the algorithm.

examples/hdbscan.rs Show resolved Hide resolved
Comment on lines +95 to +98
let condensed = condense_mst(labeled.view(), self.min_cluster_size);
let outlier_scores = glosh(&condensed, self.min_cluster_size);
let (clusters, outliers) = find_clusters(&Array1::from_vec(condensed).view());
(clusters, outliers, outlier_scores)
Copy link

@coderabbitai coderabbitai bot Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optimize conversion of condensed data structure

In the line let (clusters, outliers) = find_clusters(&Array1::from_vec(condensed).view());, consider avoiding the conversion from Vec<(usize, usize, A, usize)> to Array1 if not necessary. Operating directly on the Vec could improve performance by reducing unnecessary allocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion requires a change in the parameter types of find_clusters, which is not the scope of this PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azizkayumov, understood. Thank you for clarifying.

src/hdbscan.rs Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
src/hdbscan.rs (1)

1073-1073: Use approximate equality for floating-point assertions in tests

Floating-point comparisons using assert_eq! may fail due to precision errors. Consider using approximate equality to make the tests more robust.

Replace assertions like:

-    assert_eq!(outlier_scores[i], 0.0);
+    assert!((outlier_scores[i] - 0.0).abs() < 1e-8);

Adjust the epsilon value as appropriate for the required precision.

Also applies to: 1082-1082, 1089-1089

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 560ad58 and 96f9e94.

📒 Files selected for processing (1)
  • src/hdbscan.rs (5 hunks)
🧰 Additional context used
🪛 GitHub Check: test (macOS-latest, stable)
src/hdbscan.rs

[failure] 95-95:
the trait bound A: From<u32> is not satisfied

🪛 GitHub Check: test (ubuntu-latest, 1.74.0)
src/hdbscan.rs

[failure] 95-95:
the trait bound A: From<u32> is not satisfied

🪛 GitHub Check: check
src/hdbscan.rs

[failure] 95-95:
the trait bound A: std::convert::From<u32> is not satisfied

src/hdbscan.rs Show resolved Hide resolved
src/hdbscan.rs Outdated Show resolved Hide resolved
sync with main
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 96f9e94 and 0209d70.

📒 Files selected for processing (2)
  • examples/hdbscan.rs (2 hunks)
  • src/hdbscan.rs (6 hunks)
🔇 Additional comments (6)
examples/hdbscan.rs (2)

18-18: LGTM! Fixed typo in error message.

The correction from "corruptted" to "corrupted" improves the error message clarity.


44-44: ⚠️ Potential issue

Critical: Outlier scores feature is captured but unused

The main objective of this PR is to implement outlier scores, but the example code currently ignores them by using the _outlier_scores prefix. This doesn't fulfill the PR's purpose of demonstrating the new GLOSH outlier score functionality.

As suggested in the previous review, please modify the example to actually demonstrate the outlier scores:

-    let (clusters, outliers, _outlier_scores) = clustering.fit(&data.view());
+    let (clusters, outliers, outlier_scores) = clustering.fit(&data.view());
     println!("========= Report =========");
     println!("# of events processed: {}", data.nrows());
     println!("# of features provided: {}", data.ncols());
     println!("# of clusters: {}", clusters.len());
     println!(
         "# of events clustered: {}",
         clusters.values().map(|v| v.len()).sum::<usize>(),
     );
     println!("# of outliers: {}", outliers.len());
+    // Display outlier scores statistics
+    if let Some((min, max)) = outlier_scores.iter().fold(None, |acc, &x| {
+        Some(match acc {
+            None => (x, x),
+            Some((min, max)) => (min.min(x), max.max(x)),
+        })
+    }) {
+        println!("Outlier scores range: {:.3} to {:.3}", min, max);
+        println!("Number of points with scores: {}", outlier_scores.len());
+    }

This enhancement would:

  1. Actually use the outlier scores instead of ignoring them
  2. Provide useful statistics about the score distribution
  3. Demonstrate to users how to access and interpret the scores
src/hdbscan.rs (4)

48-60: LGTM! Clean implementation of the outlier scores return value.

The changes to the Fit trait implementation are well-structured and handle edge cases appropriately.


396-419: Well-documented GLOSH implementation!

The documentation clearly explains the algorithm with a helpful visual representation and mathematical formulas.


446-472: LGTM! Efficient implementation of max_lambdas.

The bottom-up traversal approach is well-implemented with clear comments explaining the logic. The vector sizes are correctly computed using the largest parent index.


1131-1192: Excellent test coverage for outlier scores!

The test case is comprehensive and well-documented:

  • Verifies scores for points in clusters (should be 0)
  • Validates exact outlier scores with detailed mathematical explanations
  • Tests multiple scenarios with different cluster configurations

src/hdbscan.rs Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
src/hdbscan.rs (2)

94-97: Consider optimizing the conversion of condensed data structure.

The conversion from Vec to Array1 in find_clusters call might be unnecessary if the function can be modified to accept a slice instead.

-let (clusters, outliers) = find_clusters(&Array1::from_vec(condensed).view());
+let (clusters, outliers) = find_clusters(&condensed);

396-444: Well-documented implementation of GLOSH algorithm.

The implementation correctly follows the GLOSH paper's methodology. The code is well-documented with clear explanations of the mathematical concepts and edge cases.

A few suggestions for improvement:

  1. Consider adding error handling for empty input
  2. Add debug assertions for input validation
 fn glosh<A: FloatCore>(
     condensed_mst: &[(usize, usize, A, usize)],
     min_cluster_size: usize,
 ) -> Vec<A> {
+    debug_assert!(!condensed_mst.is_empty(), "Empty MST provided");
+    debug_assert!(min_cluster_size > 0, "Invalid min_cluster_size");
+
     let deaths = max_lambdas(condensed_mst, min_cluster_size);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0209d70 and 98641ec.

📒 Files selected for processing (1)
  • src/hdbscan.rs (6 hunks)
🔇 Additional comments (3)
src/hdbscan.rs (3)

48-60: LGTM: Return type updated to include outlier scores.

The trait implementation has been correctly modified to return outlier scores as part of the tuple. The empty case is properly handled by returning an empty vector.


446-472: LGTM: Efficient implementation of max_lambdas function.

The bottom-up traversal approach is efficient and correctly implements the computation of maximum lambda values. The code is well-commented and handles both single points and clusters appropriately.


1131-1192: Excellent test coverage with detailed mathematical validation.

The test case is comprehensive and well-documented:

  • Clear dataset structure with distinct clusters and outliers
  • Detailed mathematical explanations of expected scores
  • Precise validation of both cluster points and outliers
  • Edge case coverage with different outlier scenarios

@azizkayumov
Copy link
Contributor Author

@msk Please let me know if these changes look good to you.
There is one AI suggestion remaining, which requires a change in find_clusters function's parameters. This PR is only about glosh and max_lambdas functions, so I am leaving it for now for further approval by other contributors.

@msk
Copy link
Collaborator

msk commented Nov 29, 2024

@azizkayumov, thanks for implementing outlier scores for HDBSCAN. Your PR looks good to me, and I think it's ready to be merged.

Just a note that before we release 0.11.0, we should add some more documentation and an example for Hdbscan to help users understand how to use the new outlier scores feature. But that can be done in a separate PR, so it doesn't block this one.

Thanks again for your contribution!

@msk msk merged commit 205e7e9 into petabi:main Nov 29, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement outlier scores for HDBSCAN
2 participants