Final tweaks to member enrichment #2704

epipav · 2024-12-02T08:40:32Z

Changes proposed ✍️

What

copilot:summary

copilot:poem

Why

How

copilot:walkthrough

Checklist ✅

Label appropriately with Feature, Improvement, or Bug.
Add screenshots to the PR description for relevant FE changes
New backend functionality has been unit-tested.
API documentation has been updated (if necessary) (see docs on API documentation).
Quality standards are met.

Summary by CodeRabbit

Release Notes

New Features
- Enhanced member enrichment processing with unlimited member handling.
- Introduced a new LinkedIn profile extraction method for improved data normalization.
- Added support for a new LLM model variant, CLAUDE_3_5_SONNET_V2.
- Added a credit check mechanism before data enrichment to optimize processing.
- Introduced a function for checking intersections between arrays for improved data handling.
Improvements
- Streamlined data sanitization processes across multiple services.
- Refined logic for handling member contributions and identities.
- Improved decision-making process for profile selection by checking unverified identities.
- Updated logging levels to reduce verbosity during operations.
Bug Fixes
- Adjusted logging levels to improve clarity and reduce verbosity during operations.
Chores
- Updated SQL queries to filter out bot members, ensuring only valid entries are processed.

…rified to prevent multiple org matches

coderabbitai · 2024-12-02T08:40:42Z

Walkthrough

The pull request introduces significant modifications across multiple files in the member enrichment worker service. Key changes include the addition of credit checks in data fetching, enhancements to logging levels, and improved processing logic for member data. New functions for handling string sanitization are added, and the handling of member organizations is refined. Additionally, a new model variant is introduced in the LLM settings. Overall, the updates aim to streamline member enrichment processes and improve data integrity.

Changes

File	Change Summary
`services/apps/members_enrichment_worker/src/activities/enrichment.ts`	- Updated `getEnrichmentData`, `updateMemberUsingSquashedPayload`, and added `syncMember`, `syncOrganization` functions for simplified logic and debug logging. - Refined processing of identities and contributions.
`services/apps/members_enrichment_worker/src/bin/onboarding.ts`	- Changed `maxMembersToProcess` from 1000 to `Infinity` to allow unlimited member processing. - Updated service configuration settings.
`services/apps/members_enrichment_worker/src/sources/clearbit/service.ts`	- Added `replaceDoubleQuotes` function for sanitizing employment data in `normalizeEmployment`.
`services/apps/members_enrichment_worker/src/sources/crustdata/service.ts`	- Added `replaceDoubleQuotes` for cleaning employment fields and refined email handling in `normalizeIdentities`.
`services/apps/members_enrichment_worker/src/sources/progai/service.ts`	- Updated `normalizeEmployment` to refine processing logic and added `replaceDoubleQuotes` for company fields.
`services/apps/members_enrichment_worker/src/sources/serp/service.ts`	- Introduced `getLinkedInProfileHandle` method for improved URL handling.
`services/apps/members_enrichment_worker/src/workflows/enrichMember.ts`	- Added credit check using `hasRemainingCredits` before enrichment data fetching.
`services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts`	- Enhanced identity verification and data sanitization in member enrichment processing.
`services/libs/common/src/utils.ts`	- Updated `replaceDoubleQuotes` to handle a broader range of double quote characters.
`services/libs/common_services/src/services/llm.service.ts`	- Changed logging level for estimated LLM cost from info to debug.
`services/libs/data-access-layer/src/old/apps/members_enrichment_worker/index.ts`	- Reformatted imports and added SQL condition to filter out bot members in `fetchMembersForEnrichment`.
`services/libs/data-access-layer/src/organizations/organizations.ts`	- Changed logging level from `log.warn` to `log.debug` in `findOrCreateOrganization` function.
`services/libs/types/src/enums/llm.ts`	- Added new enum value `CLAUDE_3_5_SONNET_V2`.
`services/libs/types/src/llm.ts`	- Updated model settings to include `CLAUDE_3_5_SONNET_V2` and adjusted existing entries accordingly.
`services/apps/members_enrichment_worker/src/activities.ts`	- Added `hasRemainingCredits` function for credit management.
`services/libs/common/src/array.ts`	- Introduced `hasIntersection` function to check for common elements in two arrays.
`services/libs/types/src/enrichment.ts`	- Updated `IMemberOrganizationData` interface to include an optional `identities` property.

Possibly related PRs

Member enrichment null byte sanitization, retry policy updates #2661: This PR enhances the member enrichment process by introducing a hasRemainingCredits function, which is relevant to the main PR's focus on credit management in member enrichment operations.
Improvement/stricter serp api enrichment filter #2670: This PR modifies the getEnrichmentData function to include a check for remaining credits, directly relating to the changes made in the main PR regarding credit checks before data retrieval.
Member enrichment ProgAI Linkedin scraper source #2674: This PR introduces a new function for retrieving member enrichment cache data, which aligns with the main PR's updates to member enrichment functionalities.
Crustdata linkedin enrichment #2678: This PR adds a new enrichment service for Crustdata, which is relevant as it expands the sources for member enrichment, similar to the enhancements in the main PR.
Activities sync fixes #2685: This PR focuses on improving the synchronization process for activities, which is related to the overall member enrichment workflow discussed in the main PR.
Fix profiles worker #2708: This PR addresses issues in the profiles worker, which may indirectly relate to the member enrichment processes outlined in the main PR.

Suggested labels

Improvement, Feature

🐰 "In the fields where data grows,
We enrich the members, as everyone knows.
With quotes replaced and credits checked,
Our logic is clear, our paths perfect.
So hop along, let’s process with glee,
For a brighter tomorrow, just wait and see!" 🐇

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

….com:CrowdDotDev/crowd.dev into bugfix/check-org-existence-in-enrichment-fix

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (7)

services/apps/members_enrichment_worker/src/sources/progai/service.ts (2)

277-280: Use URL Parsing for More Reliable GitHub Link Detection

Instead of using includes('github') to check if companyUrl is a GitHub link, consider parsing the URL to extract the hostname and compare it directly to 'github.com'. This will handle edge cases and make the check more robust.

294-295: Use URL Parsing to Extract LinkedIn Company Identifier

Using split('/').pop() may not reliably extract the company identifier from companyLinkedInUrl, especially if the URL has trailing slashes or query parameters. Consider using a URL parser to extract the path segments accurately.
services/apps/members_enrichment_worker/src/activities/enrichment.ts (1)
610-610: Typographical Error in Variable Name

The variable profilesFromUnverfiedIdentities has a typo. It should be profilesFromUnverifiedIdentities.

Apply this diff to correct the typo:
- const profilesFromUnverfiedIdentities: IMemberEnrichmentDataNormalized[] = []
+ const profilesFromUnverifiedIdentities: IMemberEnrichmentDataNormalized[] = []
Also, update all occurrences of this variable throughout the code.
services/apps/members_enrichment_worker/src/workflows/enrichMember.ts (1)
50-54: Consider adding logging for credit exhaustion

While the credit check logic is sound, it would be helpful to log when enrichment is skipped due to insufficient credits. This would aid in monitoring and debugging.
 if (!(await hasRemainingCredits(source))) {
+  this.log.info(`Skipping enrichment for source ${source} due to insufficient credits`, {
+    memberId: input.id,
+    source,
+  });
   await touchMemberEnrichmentCacheUpdatedAt(source, input.id)
   continue
 }
services/apps/members_enrichment_worker/src/sources/serp/service.ts (1)
138-142: Consider enhancing LinkedIn handle extraction

While the regex pattern is correct for standard LinkedIn profile URLs, consider handling additional edge cases:

URLs with tracking parameters after the handle

Language-specific LinkedIn domains

Handles containing special characters
 private getLinkedInProfileHandle(url: string): string | null {
-  const regex = /in\/([^/]+)/
+  const regex = /in\/([^/?#]+)/
   const match = url.match(regex)
-  return match ? match[1] : null
+  return match ? decodeURIComponent(match[1]) : null
 }
services/apps/members_enrichment_worker/src/sources/crustdata/service.ts (1)
355-363: Consider extracting the sanitization logic into a helper function.

While the implementation is correct, the repeated use of replaceDoubleQuotes could be simplified by extracting it into a helper function.

Consider this refactor:
+ private sanitizeEmploymentFields(fields: {
+   name?: string;
+   title?: string;
+   description?: string;
+ }) {
+   return {
+     name: fields.name ? replaceDoubleQuotes(fields.name) : null,
+     title: fields.title ? replaceDoubleQuotes(fields.title) : null,
+     description: fields.description ? replaceDoubleQuotes(fields.description) : null,
+   };
+ }

  private normalizeEmployment(
    data: IMemberEnrichmentDataCrustdata,
    normalized: IMemberEnrichmentDataNormalized,
  ): IMemberEnrichmentDataNormalized {
    // ... existing code ...

    if (employmentInformation.length > 0) {
      for (const workExperience of employmentInformation) {
        const identities = []
        // ... existing identity code ...

+       const sanitizedFields = this.sanitizeEmploymentFields({
+         name: workExperience.employer_name,
+         title: workExperience.employee_title,
+         description: workExperience.employer_linkedin_description,
+       });

        normalized.memberOrganizations.push({
-         name: replaceDoubleQuotes(workExperience.employer_name),
+         name: sanitizedFields.name,
          source: OrganizationSource.ENRICHMENT_CRUSTDATA,
          identities,
-         title: replaceDoubleQuotes(workExperience.employee_title),
+         title: sanitizedFields.title,
          startDate: workExperience?.start_date ?? null,
          endDate: workExperience?.end_date ?? null,
-         organizationDescription: replaceDoubleQuotes(
-           workExperience.employer_linkedin_description,
-         ),
+         organizationDescription: sanitizedFields.description,
        })
      }
    }
services/libs/data-access-layer/src/old/apps/members_enrichment_worker/index.ts (1)

130-130: LGTM: Added bot filtering for member enrichment

The addition of the bot filtering condition is a good optimization that prevents unnecessary processing of bot accounts. The implementation safely handles null values by defaulting to false.

Consider adding an index on the (attributes->>'isBot')::boolean expression if you expect a significant number of bot accounts, as this would improve the query performance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7f47984 and 3b5c48a.

📒 Files selected for processing (14)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (8 hunks)
services/apps/members_enrichment_worker/src/bin/onboarding.ts (1 hunks)
services/apps/members_enrichment_worker/src/sources/clearbit/service.ts (2 hunks)
services/apps/members_enrichment_worker/src/sources/crustdata/service.ts (2 hunks)
services/apps/members_enrichment_worker/src/sources/progai/service.ts (2 hunks)
services/apps/members_enrichment_worker/src/sources/serp/service.ts (1 hunks)
services/apps/members_enrichment_worker/src/workflows/enrichMember.ts (2 hunks)
services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts (3 hunks)
services/libs/common/src/utils.ts (1 hunks)
services/libs/common_services/src/services/llm.service.ts (1 hunks)
services/libs/data-access-layer/src/old/apps/members_enrichment_worker/index.ts (3 hunks)
services/libs/data-access-layer/src/organizations/organizations.ts (1 hunks)
services/libs/types/src/enums/llm.ts (1 hunks)
services/libs/types/src/llm.ts (3 hunks)

✅ Files skipped from review due to trivial changes (1)

services/libs/common_services/src/services/llm.service.ts

🧰 Additional context used

🪛 GitHub Check: lint-format-services

services/libs/data-access-layer/src/old/apps/members_enrichment_worker/index.ts

[failure] 1-1:
Delete ;

[failure] 2-2:
Delete ;

[failure] 3-3:
Replace ·IAttributes,·IEnrichableMember,·IMemberEnrichmentCache,·IMemberEnrichmentSourceQueryInput,·IMemberIdentity,·IMemberOrganizationData,·IMemberOriginalData,·IOrganizationIdentity,·MemberEnrichmentSource,·MemberIdentityType,·OrganizationSource·}·from·'@crowd/types';⏎ with ⏎··IAttributes,⏎··IEnrichableMember,⏎··IMemberEnrichmentCache,⏎··IMemberEnrichmentSourceQueryInput,⏎··IMemberIdentity,⏎··IMemberOrganizationData,⏎··IMemberOriginalData,⏎··IOrganizationIdentity,⏎··MemberEnrichmentSource,⏎··MemberIdentityType,⏎··OrganizationSource,⏎}·from·'@crowd/types'

[failure] 704-704:
Insert ⏎

🔇 Additional comments (12)

services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts (2)

236-236: Data Sanitization Applied to Squashed Attributes

Good job applying cleanAttributeValue to ensure the squashed attributes are sanitized before being added to the payload. This helps maintain data integrity.

268-283: Correct Implementation to Limit Verified Identities

The logic correctly ensures that only one verified identity is retained in each work experience by marking additional verified identities as unverified. This prevents potential database conflicts with duplicate verified identities.

services/apps/members_enrichment_worker/src/activities/enrichment.ts (1)

68-69: Simplified Control Flow in getEnrichmentData Function

Removing the credit check from getEnrichmentData simplifies the function's control flow. Ensure that credit checks are handled appropriately elsewhere to manage resource usage.

services/libs/types/src/enums/llm.ts (1)

3-3: New LLM Model Type Added

The addition of CLAUDE_3_5_SONNET_V2 to the LlmModelType enum is correctly implemented. This extends the available models for LLM queries.

services/libs/common/src/utils.ts (1)

76-76: Improved Double Quote Replacement

The replaceDoubleQuotes function now handles various Unicode representations of double quotes, enhancing data sanitization across different character sets.

services/libs/types/src/llm.ts (2)

46-49: Confirm pricing accuracy for new model

The pricing configuration for CLAUDE_3_5_SONNET_V2 matches CLAUDE_3_5_SONNET. Please verify these values are accurate according to the provider's documentation.

62-62: Verify impact of model change on enrichment quality

Multiple enrichment queries have been updated to use CLAUDE_3_5_SONNET_V2. While the model configurations are consistent, we should verify that the new model maintains or improves the quality of enrichment results.

Also applies to: 70-70, 78-78

services/apps/members_enrichment_worker/src/sources/clearbit/service.ts (2)

3-3: LGTM: Import statement added correctly.

The replaceDoubleQuotes utility function is imported from the common package.

220-223: LGTM: Employment data sanitization implemented correctly.

The replaceDoubleQuotes function is appropriately applied to both the organization name and title fields.

services/apps/members_enrichment_worker/src/bin/onboarding.ts (1)

30-30: Verify resource implications of unlimited member processing.

Changing maxMembersToProcess to Infinity removes the upper bound on member processing. While this allows for complete processing of all eligible members, consider:

Memory usage over extended runs
Database connection pool limits
API rate limits for enrichment services

The maxConcurrentProcessing limit of 5 provides some protection, but monitoring system resources during extended runs is advised.

✅ Verification successful

Let me gather more specific information about memory management and concurrency controls in this worker.

Let me check the member processing implementation to understand the batching and database interaction patterns.

Based on the gathered information, I can now provide a final response about the resource implications:

Setting maxMembersToProcess to Infinity is acceptable with current safeguards

The implementation has several built-in protections against resource exhaustion:

Concurrent processing is limited to 5 members at a time via maxConcurrentProcessing = 5
Members are processed in pages with controlled database queries that filter for specific criteria (>1 enrichment source, >100 activities)
Each member workflow has a 15-minute timeout and retry limits
Database load is naturally regulated through the temporal workflow queue

The Infinity setting simply allows the worker to continue processing eligible members without an arbitrary cap, while the above controls prevent resource overload.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any rate limiting configurations in the codebase
rg -A 3 'rate.*limit|throttle|maxRequest|concurrent'

# Look for memory management configurations
rg -A 3 'memory|heap|cache'

Length of output: 66264

Script:

#!/bin/bash
# Look for worker configuration and concurrency settings
rg -A 3 "maxConcurrentProcessing|batch.*size|chunk.*size|queue.*config" services/apps/members_enrichment_worker/

# Look for memory management in the worker
rg -A 3 "memory|heap" services/apps/members_enrichment_worker/

Length of output: 917

Script:

#!/bin/bash
# Look for database operations and member processing logic
rg -A 10 "getEnrichableMembers|processMember" services/apps/members_enrichment_worker/src/bin/onboarding.ts

# Check for any database connection pool configurations
rg -A 3 "pool.*size|connection.*pool" services/libs/data-access-layer/

Length of output: 2625

services/apps/members_enrichment_worker/src/sources/crustdata/service.ts (1)

3-3: LGTM: Import statement added correctly.

The replaceDoubleQuotes utility function is imported alongside the existing isEmail import.

services/libs/data-access-layer/src/organizations/organizations.ts (1)

516-519: LGTM: Appropriate logging level adjustment

The change from warn to debug level is appropriate here as this is an expected scenario in the normal flow, not a warning condition.

coderabbitai · 2024-12-02T09:10:20Z

services/apps/members_enrichment_worker/src/sources/serp/service.ts

          type: MemberIdentityType.USERNAME,
          verified: false,
-          value: this.normalizeLinkedUrl(data.linkedinUrl).split('/').pop(),
+          value: this.getLinkedInProfileHandle(this.normalizeLinkedUrl(data.linkedinUrl)),


🛠️ Refactor suggestion

Add error handling for handle extraction

The normalize method should handle cases where getLinkedInProfileHandle returns null.

- value: this.getLinkedInProfileHandle(this.normalizeLinkedUrl(data.linkedinUrl)), + value: this.getLinkedInProfileHandle(this.normalizeLinkedUrl(data.linkedinUrl)) ?? + this.log.warn('Failed to extract LinkedIn handle', { url: data.linkedinUrl }) || null,

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts (1)
268-282: Consider extracting the identity verification logic into a separate function

While the logic for handling verified identities is correct, it could be more maintainable if extracted into a dedicated function. This would improve readability and make the code more testable.

Consider refactoring to:
+ function ensureSingleVerifiedIdentity(identities: any[]): any[] {
+   let found = false;
+   return (identities || []).map((i) => {
+     if (i.verified && !found) {
+       found = true;
+       return i;
+     } else if (i.verified) {
+       return { ...i, verified: false };
+     }
+     return i;
+   });
+ }

  workExperiencesSquashedByLLM.forEach((we) => {
-   let found = false
-   we.identities = (we.identities || []).map((i) => {
-     if (i.verified && !found) {
-       found = true;
-       return i;
-     } else if (i.verified) {
-       return { ...i, verified: false };
-     }
-     return i;
-   });
+   we.identities = ensureSingleVerifiedIdentity(we.identities);
  });
This refactoring:

Improves code organization and reusability

Makes the logic easier to test in isolation

Maintains the same functionality while being more maintainable

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9fac18e and 64aee6c.

📒 Files selected for processing (2)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (8 hunks)
services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

services/apps/members_enrichment_worker/src/activities/enrichment.ts

🔇 Additional comments (1)

services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts (1)

236-236: LGTM: Consistent attribute value cleaning

The addition of cleanAttributeValue for LLM-squashed attributes maintains data consistency with individually processed attributes.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

services/apps/members_enrichment_worker/src/sources/progai/service.ts (2)

275-289: Consider making domain verification logic more explicit

The GitHub URL exclusion logic is good, but consider extracting it into a separate method for better maintainability and reusability.

+ private isValidCompanyDomain(url: string, company: string): boolean {
+   const isGithubUrl = url.toLowerCase().includes('github')
+   const isGithubCompany = (company || '').toLowerCase().includes('github')
+   return !isGithubUrl && !isGithubCompany
+ }

  if (workExperience.companyUrl) {
    const normalizedDomain = websiteNormalizer(workExperience.companyUrl, false)
-   if (
-     normalizedDomain &&
-     !workExperience.companyUrl.toLowerCase().includes('github') &&
-     !(workExperience.company || '').toLowerCase().includes('github')
-   ) {
+   if (normalizedDomain && this.isValidCompanyDomain(workExperience.companyUrl, workExperience.company)) {

320-336: Consider enhancing LinkedIn URL handling

While the current implementation handles basic company and school URLs well, consider:

Adding input validation
Supporting more LinkedIn URL formats (e.g., with query parameters)

 private getLinkedInProfileHandle(url: string): string | null {
+  if (!url || typeof url !== 'string') {
+    return null;
+  }
+  
+  // Remove query parameters and trailing slashes
+  const cleanUrl = url.split('?')[0].replace(/\/+$/, '');
+
   let regex = /company\/([^/]+)/
-  let match = url.match(regex)
+  let match = cleanUrl.match(regex)

   if (match) {
     return `company:${match[1]}`
   }

   regex = /school\/([^/]+)/
-  match = url.match(regex)
+  match = cleanUrl.match(regex)

   if (match) {
     return `school:${match[1]}`
   }

   return null
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 64aee6c and a859861.

📒 Files selected for processing (1)

services/apps/members_enrichment_worker/src/sources/progai/service.ts (2 hunks)

🔇 Additional comments (2)

services/apps/members_enrichment_worker/src/sources/progai/service.ts (2)

4-4: LGTM: Good addition of string sanitization utility

The addition of replaceDoubleQuotes from @crowd/common is a good practice for sanitizing company names and titles.

264-268: Good improvement in null handling

The enhanced conditions ensure we only process work experiences with valid company information, preventing potential issues with null values.

coderabbitai · 2024-12-03T13:39:04Z

services/apps/members_enrichment_worker/src/sources/progai/service.ts

+            startDate: workExperience.startDate
+              ? workExperience.startDate.replace('Z', '+00:00')
+              : null,
+            endDate: workExperience.endDate ? workExperience.endDate.replace('Z', '+00:00') : null,


💡 Codebase verification

Based on the code analysis, I can now provide the final response:

Add date format validation for work experience dates

The code assumes dates from the ProgAI API are in a format ending with 'Z' (UTC timezone marker), but there's no validation to ensure this. The type definition shows that startDate and endDate are just string types without format constraints. Consider:

Adding input validation to ensure dates match the expected format before manipulation

Using a date parsing library like date-fns or similar to handle the timezone conversion safely

Adding error handling for malformed dates

🔗 Analysis chain

Validate date string manipulation

The date string manipulation assumes a specific format. Consider adding validation or using a date manipulation library to handle different date formats safely.

Let me check the input date format and validation in the codebase.

Let me check if there's any date format validation or parsing before this transformation.

Let me check the entire file content to understand the context better.

Let me check the type definitions for the work experience data structure.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Check for potential date format issues in the codebase rg -A 2 "replace$'Z', '\+00:00'$"

Length of output: 608

Script:

#!/bin/bash # Check the workExperience type definition and any date validation ast-grep --pattern 'interface $_WorkExperience { $$$ startDate$_ $$$ endDate$_ $$$ }' # Check for any date validation or parsing logic rg -B 2 -A 2 "workExperience\.(startDate|endDate)"

Length of output: 1053

Script:

#!/bin/bash # Check for date validation or parsing in the file rg -B 5 -A 5 "function.*workExperience" services/apps/members_enrichment_worker/src/sources/progai/service.ts # Check if there's any date validation utility being used rg -l "import.*date|moment|luxon|dayjs" services/apps/members_enrichment_worker/src/sources/progai/service.ts # Look for any date format validation ast-grep --pattern 'function $_(workExperience$_) { $$$ }'

Length of output: 286

Script:

#!/bin/bash # Get the content of the service file cat services/apps/members_enrichment_worker/src/sources/progai/service.ts # Check for any date-related types or interfaces ast-grep --pattern 'type $_ = { $$$ date$_ $$$ }' # Check for any date validation functions rg -l "validateDate|isValidDate|parseDate" services/apps/members_enrichment_worker/src/sources/progai/service.ts

Length of output: 11440

Script:

#!/bin/bash # Check the types file for work experience interface cat services/apps/members_enrichment_worker/src/sources/progai/types.ts # Check if there's any date format validation in the utils rg -l "date" services/apps/members_enrichment_worker/src/utils/ # Look for any date-related imports in the types file rg "import.*date" services/apps/members_enrichment_worker/src/sources/progai/types.ts

Length of output: 2202

…eriences from enrichment

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts (2)
268-282: Consider enhancing identity selection logic

While the logic to prevent multiple verified identities is sound, consider:

Adding logging when identities are marked as unverified

Implementing a more deterministic selection criteria (e.g., based on timestamp or data quality)
 workExperiencesSquashedByLLM.forEach((we) => {
   let found = false
   we.identities = (we.identities || []).map((i) => {
     if (i.verified && !found) {
       found = true
       return i
     } else if (i.verified) {
+      console.log(`Marking identity as unverified for organization to prevent merge issues: ${JSON.stringify(i)}`)
       return { ...i, verified: false }
     }
     return i
   })
 })
Line range hint 1-307: Consider architectural improvements

While the code is functional, consider the following improvements:

Break down the large processMemberSources function into smaller, focused functions for better maintainability:

squashIdentities

squashAttributes

squashWorkExperiences

Add performance monitoring for LLM operations

Enhance error handling for edge cases

This would improve maintainability and make the code easier to test.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a859861 and 5e2fcb0.

📒 Files selected for processing (4)

services/apps/members_enrichment_worker/src/activities.ts (2 hunks)
services/apps/members_enrichment_worker/src/activities/enrichment.ts (13 hunks)
services/apps/members_enrichment_worker/src/workflows/enrichMember.ts (2 hunks)
services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

services/apps/members_enrichment_worker/src/workflows/enrichMember.ts
services/apps/members_enrichment_worker/src/activities/enrichment.ts

🔇 Additional comments (5)

services/apps/members_enrichment_worker/src/activities.ts (2)

11-11: LGTM! Import added in correct alphabetical order.

The new import follows the existing pattern and maintains alphabetical ordering.

79-79: Consider maintaining alphabetical order in exports and verify credit check implementation.

While the export addition is functionally correct, consider maintaining alphabetical ordering in the exports list for better maintainability. Additionally, since this function is critical for managing enrichment credits, ensure it's properly integrated across all enrichment workflows.

Let's verify the credit check implementation:

✅ Verification successful

Credit check implementation is properly integrated and working as expected.

The verification shows that:

The hasRemainingCredits function is properly defined in the interface (types.ts) and implemented by all enrichment sources
It's correctly used in the enrichment workflow to check credits before proceeding with enrichment
The implementation includes proper caching mechanism in Redis to avoid frequent API calls
Each source has its own implementation:
- SERP API checks account endpoint
- CrustData checks credits endpoint
- Others have placeholder implementations returning true

The only minor suggestion about alphabetical ordering in exports remains valid, but the credit check implementation is robust and properly integrated.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the credit check implementation and its usage

# Check where hasRemainingCredits is defined and how it's implemented
ast-grep --pattern 'function hasRemainingCredits($_) {
  $$$
}'

# Check where hasRemainingCredits is used in enrichment workflows
rg -A 5 'hasRemainingCredits'

Length of output: 8838

services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts (3)

236-236: LGTM: Consistent attribute sanitization

Good addition of sanitization for LLM-processed attributes, maintaining data consistency with other attribute processing.

268-282: Verify impact on existing organizations

The new identity verification logic could affect organization merging behavior.

297-297: Verify CrustData profile selection impact

The addition of the CrustData profile selection flag affects member updates.

✅ Verification successful

CrustData profile selection flag is correctly implemented

The flag isHighConfidenceSourceSelectedForWorkExperiences (passed as !!crustDataProfileSelected) is used to determine how work experiences should be handled during member updates. When CrustData is selected as the source:

All existing work experiences are deleted (including manually created ones)
All new work experiences from CrustData are created
This indicates high confidence in CrustData's accuracy for work experience data

This implementation is intentional and aligns with the workflow's purpose of preferring CrustData's work experience data when available.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify how the CrustData profile selection flag affects member updates

# Search for the usage of this flag in the update logic
ast-grep --pattern 'updateMemberUsingSquashedPayload.*crustData'

# Look for related CrustData processing logic
rg -A 5 'CrustData.*profile|profile.*CrustData'

Length of output: 116

Script:

#!/bin/bash
# Let's try a broader search to understand the context better

# Search for updateMemberUsingSquashedPayload function definition and usage
ast-grep --pattern 'function updateMemberUsingSquashedPayload'

# Search for CrustData related code
rg -l "CrustData"

# Search for the specific file content to understand the context
cat services/apps/members_enrichment_worker/src/workflows/processMemberSources.ts

Length of output: 10651

Script:

#!/bin/bash
# Let's look for the activity implementation to understand how the flag is used

# Search for updateMemberUsingSquashedPayload implementation
fd -e ts updateMemberUsingSquashedPayload --exec cat {}

# Search for any related configuration or constants
rg -A 5 "updateMemberUsingSquashedPayload.*crustData" services/apps/members_enrichment_worker/src/activities/

Length of output: 171

Script:

#!/bin/bash
# Let's try a different approach to locate the activity implementation

# Look for the activities directory content
ls -R services/apps/members_enrichment_worker/src/activities/

# Try to find the enrichment activity file
cat services/apps/members_enrichment_worker/src/activities/enrichment.ts

Length of output: 33974

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (1)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (1)

233-233: Consider keeping important state changes at INFO level.

While reducing log verbosity is good, consider keeping important state changes at INFO level for better operational visibility. For example, member attribute updates and reach changes are significant state modifications that operations might want to monitor without enabling DEBUG logging.

Also applies to: 252-252, 284-284, 302-302

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5e2fcb0 and 22121ea.

📒 Files selected for processing (1)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (17 hunks)

coderabbitai · 2024-12-06T09:45:47Z

services/apps/members_enrichment_worker/src/activities/enrichment.ts

+export async function syncMember(memberId: string): Promise<void> {
+  const syncApi = new SearchSyncApiClient({
+    baseUrl: process.env['CROWD_SEARCH_SYNC_API_URL'],
+  })

-  // Convert strings to timestamps, using fallbacks for missing dates
-  const start1 = d1Start ? new Date(d1Start).getTime() : -Infinity
-  const end1 = d1End ? new Date(d1End).getTime() : Infinity
-  const start2 = d2Start ? new Date(d2Start).getTime() : -Infinity
-  const end2 = d2End ? new Date(d2End).getTime() : Infinity
+  await syncApi.triggerMemberSync(memberId, { withAggs: false })
+}
+
+export async function syncOrganization(organizationId: string): Promise<void> {
+  const syncApi = new SearchSyncApiClient({
+    baseUrl: process.env['CROWD_SEARCH_SYNC_API_URL'],
+  })

-  // Periods intersect if one period's start is before other period's end
-  // and that same period's end is after the other period's start
-  return start1 <= end2 && end1 >= start2
+  await syncApi.triggerOrganizationSync(organizationId, undefined, { withAggs: false })


⚠️ Potential issue

Add error handling and environment validation for sync functions.

The sync functions need:

Error handling for API calls

Validation of environment variables

Retry logic for transient failures

Add error handling:

export async function syncMember(memberId: string): Promise<void> { + const apiUrl = process.env['CROWD_SEARCH_SYNC_API_URL'] + if (!apiUrl) { + throw new Error('CROWD_SEARCH_SYNC_API_URL environment variable is not set') + } + const syncApi = new SearchSyncApiClient({ - baseUrl: process.env['CROWD_SEARCH_SYNC_API_URL'], + baseUrl: apiUrl, }) - await syncApi.triggerMemberSync(memberId, { withAggs: false }) + try { + await syncApi.triggerMemberSync(memberId, { withAggs: false }) + } catch (error) { + svc.log.error({ error, memberId }, 'Failed to sync member') + throw error + } }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-06T09:45:47Z

services/apps/members_enrichment_worker/src/activities/enrichment.ts

+  isHighConfidenceSourceSelectedForWorkExperiences: boolean,
 ): IWorkExperienceChanges {
  // we delete all the work experiences that were not manually created
-  const toDelete = oldVersion.filter((c) => c.source !== OrganizationSource.UI)
+  let toDelete = oldVersion.filter((c) => c.source !== OrganizationSource.UI)

  const toCreate: IMemberEnrichmentDataNormalizedOrganization[] = []
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  const toUpdate: Map<IMemberOrganizationData, Record<string, any>> = new Map()

+  if (isHighConfidenceSourceSelectedForWorkExperiences) {
+    toDelete = oldVersion
+    toCreate.push(...newVersion)
+    return {
+      toDelete,
+      toCreate,
+      toUpdate,
+    }
+  }


⚠️ Potential issue

Document the high confidence override behavior and add safeguards.

The isHighConfidenceSourceSelectedForWorkExperiences parameter allows overriding all existing work experiences, including manually created ones. This is a critical change that needs:

Documentation explaining when this override should be used

Safeguards to prevent accidental data loss

Logging of override events for audit purposes

Add documentation and logging:

function prepareWorkExperiences( oldVersion: IMemberOrganizationData[], newVersion: IMemberEnrichmentDataNormalizedOrganization[], isHighConfidenceSourceSelectedForWorkExperiences: boolean, ): IWorkExperienceChanges { + // When isHighConfidenceSourceSelectedForWorkExperiences is true, all existing work experiences + // will be replaced with the new version, including manually created ones. This should only be + // used when the source data is highly reliable and complete. + svc.log.info( + { isHighConfidenceSourceSelectedForWorkExperiences }, + 'Processing work experiences with override mode' + )

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-06T09:45:47Z

services/apps/members_enrichment_worker/src/activities/enrichment.ts

+    const match = orderedNewVersion.find(
      (e) =>
        e.title === current.jobTitle &&
        e.identities &&
        e.identities.some((e) => e.organizationId === current.orgId),
    )
-    if (!match) {
-      // if we didn't find a match by title we should check dates
-      match = orderedNewVersion.find(
-        (e) =>
-          dateIntersects(current.dateStart, current.dateEnd, e.startDate, e.endDate) &&
-          e.identities &&
-          e.identities.some((e) => e.organizationId === current.orgId),
-      )
-    }

    // if we found a match we can check if we need something to update
-    if (match) {
-      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+    if (
+      match &&
+      current.dateStart === match.startDate &&
+      current.dateEnd === null &&
+      match.endDate !== null
+    ) {
      const toUpdateInner: Record<string, any> = {}

-      // lets check if the dates and title are the same otherwise we need to update them
-      if (current.dateStart !== match.startDate) {
-        toUpdateInner.dateStart = match.startDate
-      }
-
-      if (current.dateEnd !== match.endDate) {
-        toUpdateInner.dateEnd = match.endDate
-      }
-
-      if (current.jobTitle !== match.title) {
-        toUpdateInner.title = match.title
-      }
-
-      if (Object.keys(toUpdateInner).length > 0) {
-        toUpdate.set(current, toUpdateInner)
-      }
+      toUpdateInner.dateEnd = match.endDate
+      toUpdate.set(current, toUpdateInner)

      // remove the match from the new version array so we later don't process it again
      orderedNewVersion = orderedNewVersion.filter((e) => e.id !== match.id)
+    } else if (
+      match &&
+      (current.dateStart !== match.startDate || current.dateEnd !== null || match.endDate === null)
+    ) {
+      // there's an incoming work experiences, but it's conflicting with the existing manually updated data
+      // we shouldn't add or update anything when this happens
+      // we can only update dateEnd of existing manually changed data, when it has a null dateEnd
+      orderedNewVersion = orderedNewVersion.filter((e) => e.id !== match.id)


🛠️ Refactor suggestion

Improve error handling for work experience matching.

The work experience matching logic needs better error handling:

Validation of input data structures

Handling of edge cases where identities array is undefined

Logging of matching decisions for debugging

Add error handling:

const match = orderedNewVersion.find( (e) => e.title === current.jobTitle && - e.identities && - e.identities.some((e) => e.organizationId === current.orgId), + Array.isArray(e.identities) && + e.identities.some((identity) => { + if (!identity || !identity.organizationId) { + svc.log.warn({ experience: e }, 'Invalid identity data in work experience') + return false + } + return identity.organizationId === current.orgId + }), ) + svc.log.debug( + { current, match }, + match ? 'Found matching work experience' : 'No matching work experience found' + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const match = orderedNewVersion.find(

(e) =>

e.title === current.jobTitle &&

e.identities &&

e.identities.some((e) => e.organizationId === current.orgId),

)

if (!match) {

// if we didn't find a match by title we should check dates

match = orderedNewVersion.find(

(e) =>

dateIntersects(current.dateStart, current.dateEnd, e.startDate, e.endDate) &&

e.identities &&

e.identities.some((e) => e.organizationId === current.orgId),

)

}

// if we found a match we can check if we need something to update

if (match) {

// eslint-disable-next-line @typescript-eslint/no-explicit-any

if (

match &&

current.dateStart === match.startDate &&

current.dateEnd === null &&

match.endDate !== null

) {

const toUpdateInner: Record<string, any> = {}

// lets check if the dates and title are the same otherwise we need to update them

if (current.dateStart !== match.startDate) {

toUpdateInner.dateStart = match.startDate

}

if (current.dateEnd !== match.endDate) {

toUpdateInner.dateEnd = match.endDate

}

if (current.jobTitle !== match.title) {

toUpdateInner.title = match.title

}

if (Object.keys(toUpdateInner).length > 0) {

toUpdate.set(current, toUpdateInner)

}

toUpdateInner.dateEnd = match.endDate

toUpdate.set(current, toUpdateInner)

// remove the match from the new version array so we later don't process it again

orderedNewVersion = orderedNewVersion.filter((e) => e.id !== match.id)

} else if (

match &&

(current.dateStart !== match.startDate || current.dateEnd !== null || match.endDate === null)

) {

// there's an incoming work experiences, but it's conflicting with the existing manually updated data

// we shouldn't add or update anything when this happens

// we can only update dateEnd of existing manually changed data, when it has a null dateEnd

orderedNewVersion = orderedNewVersion.filter((e) => e.id !== match.id)

const match = orderedNewVersion.find(

(e) =>

e.title === current.jobTitle &&

Array.isArray(e.identities) &&

e.identities.some((identity) => {

if (!identity || !identity.organizationId) {

svc.log.warn({ experience: e }, 'Invalid identity data in work experience')

return false

}

return identity.organizationId === current.orgId

}),

)

svc.log.debug(

{ current, match },

match ? 'Found matching work experience' : 'No matching work experience found'

)

// if we found a match we can check if we need something to update

if (

match &&

current.dateStart === match.startDate &&

current.dateEnd === null &&

match.endDate !== null

) {

const toUpdateInner: Record<string, any> = {}

toUpdateInner.dateEnd = match.endDate

toUpdate.set(current, toUpdateInner)

// remove the match from the new version array so we later don't process it again

orderedNewVersion = orderedNewVersion.filter((e) => e.id !== match.id)

} else if (

match &&

(current.dateStart !== match.startDate || current.dateEnd !== null || match.endDate === null)

) {

// there's an incoming work experiences, but it's conflicting with the existing manually updated data

// we shouldn't add or update anything when this happens

// we can only update dateEnd of existing manually changed data, when it has a null dateEnd

orderedNewVersion = orderedNewVersion.filter((e) => e.id !== match.id)

coderabbitai · 2024-12-06T09:45:48Z

services/apps/members_enrichment_worker/src/activities/enrichment.ts

+      // try matching member's existing organizations with the new ones
+      // we'll be using displayName, title, dates
      for (const org of squashedPayload.memberOrganizations) {
+        if (!org.organizationId) {
+          // Check if any similar in existing work experiences
+          const existingOrg = existingMemberData.organizations.find((o) => {
+            const incomingOrgStartDate = org.startDate ? new Date(org.startDate) : null
+            const incomingOrgEndDate = org.endDate ? new Date(org.endDate) : null
+            const existingOrgStartDate = o.dateStart ? new Date(o.dateStart) : null
+            const existingOrgEndEndDate = o.dateEnd ? new Date(o.dateEnd) : null
+
+            const isSameStartMonthYear =
+              (!incomingOrgStartDate && !existingOrgStartDate) || // Both start dates are null
+              (incomingOrgStartDate &&
+                existingOrgStartDate &&
+                incomingOrgStartDate.getMonth() === existingOrgStartDate.getMonth() &&
+                incomingOrgStartDate.getFullYear() === existingOrgStartDate.getFullYear())
+
+            const isSameEndMonthYear =
+              (!incomingOrgEndDate && !existingOrgEndEndDate) || // Both end dates are null
+              (incomingOrgEndDate &&
+                existingOrgEndEndDate &&
+                incomingOrgEndDate.getMonth() === existingOrgEndEndDate.getMonth() &&
+                incomingOrgEndDate.getFullYear() === existingOrgEndEndDate.getFullYear())
+
+            return (
+              (o.orgName.toLowerCase().includes(org.name.toLowerCase()) ||
+                org.name.toLowerCase().includes(o.orgName.toLowerCase())) &&
+              ((isSameStartMonthYear && isSameEndMonthYear) || org.title === o.jobTitle)
+            )
+          })
+
+          if (existingOrg) {
+            // Get all orgs with the same name as the current one
+            const matchingOrgs = squashedPayload.memberOrganizations.filter(
+              (otherOrg) => otherOrg.name === org.name,
+            )
+
+            // Set organizationId for all matching orgs
+            for (const matchingOrg of matchingOrgs) {
+              matchingOrg.organizationId = existingOrg.orgId
+            }
+          }
+        }
+      }


🛠️ Refactor suggestion

Refactor organization matching logic for better maintainability.

The organization matching logic is complex and could benefit from:

Extracting date comparison logic into a helper function

Using a more robust name comparison (e.g., Levenshtein distance)

Adding comments explaining the matching criteria

Consider refactoring like this:

+function isSameMonthYear(date1: Date | null, date2: Date | null): boolean { + return (!date1 && !date2) || // Both dates are null + (date1 && date2 && + date1.getMonth() === date2.getMonth() && + date1.getFullYear() === date2.getFullYear()) +} + +function areOrganizationNamesSimilar(name1: string, name2: string): boolean { + // TODO: Implement Levenshtein distance comparison + return name1.toLowerCase().includes(name2.toLowerCase()) || + name2.toLowerCase().includes(name1.toLowerCase()) +} // Check if any similar in existing work experiences const existingOrg = existingMemberData.organizations.find((o) => { const incomingOrgStartDate = org.startDate ? new Date(org.startDate) : null const incomingOrgEndDate = org.endDate ? new Date(org.endDate) : null const existingOrgStartDate = o.dateStart ? new Date(o.dateStart) : null const existingOrgEndEndDate = o.dateEnd ? new Date(o.dateEnd) : null - const isSameStartMonthYear = - (!incomingOrgStartDate && !existingOrgStartDate) || // Both start dates are null - (incomingOrgStartDate && - existingOrgStartDate && - incomingOrgStartDate.getMonth() === existingOrgStartDate.getMonth() && - incomingOrgStartDate.getFullYear() === existingOrgStartDate.getFullYear()) + const isSameStartMonthYear = isSameMonthYear(incomingOrgStartDate, existingOrgStartDate) + const isSameEndMonthYear = isSameMonthYear(incomingOrgEndDate, existingOrgEndEndDate) return ( - (o.orgName.toLowerCase().includes(org.name.toLowerCase()) || - org.name.toLowerCase().includes(o.orgName.toLowerCase())) && + areOrganizationNamesSimilar(o.orgName, org.name) && ((isSameStartMonthYear && isSameEndMonthYear) || org.title === o.jobTitle) ) })

Committable suggestion skipped: line range outside the PR's diff.

…le enriching

…nce optimization

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (2)

239-239: Consider using structured logging.

The logging changes from info to debug are consistent, but could benefit from structured logging patterns.

Consider enhancing the logging:

-svc.log.debug({ memberId }, 'Adding to member identities!')
+svc.log.debug({ 
+  memberId,
+  action: 'add_member_identities',
+  identityCount: squashedPayload.identities.length 
+}, 'Adding member identities')

Also applies to: 258-258, 290-290, 308-308

330-351: Enhance organization matching logic with early returns.

The nested organization matching logic could be more maintainable with early returns and clearer variable names.

Consider restructuring:

-      // try matching member's existing organizations with the new ones
-      // we'll be using displayName, title, dates
-      for (const org of squashedPayload.memberOrganizations) {
-        if (!org.organizationId) {
-          // Check if any similar in existing work experiences
-          const existingOrg = existingMemberData.organizations.find((o) =>
-            doesIncomingOrgExistInExistingOrgs(o, org),
-          )
-
-          if (existingOrg) {
-            // Get all orgs with the same name as the current one
-            const matchingOrgs = squashedPayload.memberOrganizations.filter(
-              (otherOrg) => otherOrg.name === org.name,
-            )
-
-            // Set organizationId for all matching orgs
-            for (const matchingOrg of matchingOrgs) {
-              matchingOrg.organizationId = existingOrg.orgId
-            }
-          }
-        }
-      }
+      // Match member's existing organizations with new ones using displayName, title, and dates
+      for (const incomingOrg of squashedPayload.memberOrganizations) {
+        // Skip if organization is already identified
+        if (incomingOrg.organizationId) continue;
+
+        // Find matching existing organization
+        const matchingExistingOrg = existingMemberData.organizations.find(existingOrg =>
+          doesIncomingOrgExistInExistingOrgs(existingOrg, incomingOrg)
+        );
+
+        if (!matchingExistingOrg) continue;
+
+        // Update all organizations with the same name
+        const relatedOrgs = squashedPayload.memberOrganizations.filter(
+          org => org.name === incomingOrg.name
+        );
+        
+        relatedOrgs.forEach(org => {
+          org.organizationId = matchingExistingOrg.orgId;
+        });
+      }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3a52818 and 82aed4f.

📒 Files selected for processing (4)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (20 hunks)
services/libs/common/src/array.ts (1 hunks)
services/libs/data-access-layer/src/old/apps/members_enrichment_worker/index.ts (3 hunks)
services/libs/types/src/enrichment.ts (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

services/libs/data-access-layer/src/old/apps/members_enrichment_worker/index.ts

🔇 Additional comments (6)

services/libs/types/src/enrichment.ts (2)

3-3: LGTM! Import statement is properly placed and used.

The import of IOrganizationIdentity is correctly placed with other related imports and is utilized within the file.

43-43: LGTM! Interface extension enhances organization identity tracking.

The optional identities property is well-designed:

Uses consistent naming with other interfaces
Properly typed as an array of organization identities
Maintains backward compatibility by being optional

Let's verify the usage of this new property:

✅ Verification successful

Based on the search results, I can now generate the final response:

The optional identities property is well-integrated and actively used across the codebase

The property is properly integrated into the organization-related functionality:

Used in member enrichment worker for organization data processing
Utilized in organization similarity calculations for merge suggestions
Implemented in organization repositories for identity management
Integrated with audit logging for tracking identity changes
Applied in organization sync services for data synchronization

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check how the new identities property is being used across the codebase

# Search for IMemberOrganizationData usage
rg -t typescript "IMemberOrganizationData" -A 5

# Search for specific usage of the identities property
rg -t typescript "organizations.*identities" -A 5

Length of output: 169

Script:

#!/bin/bash
# Let's try without typescript file type flag since it's not recognized
# Search for IMemberOrganizationData interface usage
rg "IMemberOrganizationData" -A 5

# Search for files containing organization identities
rg "organizations.*identities" -A 5

# Also search for direct usage of IOrganizationIdentity
rg "IOrganizationIdentity" -A 5

Length of output: 33257

services/apps/members_enrichment_worker/src/activities/enrichment.ts (4)

3-8: LGTM! Clean import organization.

The imports are well-organized and the new utilities from @crowd/common are properly imported.

934-940: ⚠️ Potential issue

Add error handling for sync API calls.

The syncMember function needs proper error handling and environment validation.

942-947: ⚠️ Potential issue

Add error handling for organization sync API calls.

The syncOrganization function needs proper error handling and environment validation.

Apply similar error handling as suggested for syncMember:

 export async function syncOrganization(organizationId: string): Promise<void> {
+  const apiUrl = process.env['CROWD_SEARCH_SYNC_API_URL']
+  if (!apiUrl) {
+    throw new Error('CROWD_SEARCH_SYNC_API_URL environment variable is not set')
+  }
+
   const syncApi = new SearchSyncApiClient({
-    baseUrl: process.env['CROWD_SEARCH_SYNC_API_URL'],
+    baseUrl: apiUrl,
   })
 
-  await syncApi.triggerOrganizationSync(organizationId, undefined, { withAggs: false })
+  try {
+    await syncApi.triggerOrganizationSync(organizationId, undefined, { withAggs: false })
+  } catch (error) {
+    svc.log.error({ error, organizationId }, 'Failed to sync organization')
+    throw error
+  }
 }

Likely invalid or redundant comment.

439-480: 🛠️ Refactor suggestion

Extract date comparison logic into helper functions.

The doesIncomingOrgExistInExistingOrgs function has complex date comparison logic that could be simplified.

Consider extracting helper functions:

+function getVerifiedPrimaryDomains(identities: any[]): string[] {
+  return identities
+    .filter(i => i.type === OrganizationIdentityType.PRIMARY_DOMAIN && i.verified)
+    .map(i => i.value);
+}
+
+function isSameMonthYear(date1: Date | null, date2: Date | null): boolean {
+  if (!date1 && !date2) return true;
+  if (!date1 || !date2) return false;
+  return date1.getMonth() === date2.getMonth() && 
+         date1.getFullYear() === date2.getFullYear();
+}
+
+function areOrganizationNamesSimilar(name1: string, name2: string): boolean {
+  const n1 = name1.toLowerCase();
+  const n2 = name2.toLowerCase();
+  return n1.includes(n2) || n2.includes(n1);
+}
+
 export function doesIncomingOrgExistInExistingOrgs(
   existingOrg: IMemberOrganizationData,
   incomingOrg: IMemberEnrichmentDataNormalizedOrganization,
 ): boolean {
-  const incomingVerifiedPrimaryDomainIdentityValues = incomingOrg.identities
-    .filter((i) => i.type === OrganizationIdentityType.PRIMARY_DOMAIN && i.verified)
-    .map((i) => i.value)
-
-  const existingVerifiedPrimaryDomainIdentityValues = existingOrg.identities
-    .filter((i) => i.type === OrganizationIdentityType.PRIMARY_DOMAIN && i.verified)
-    .map((i) => i.value)
+  const incomingDomains = getVerifiedPrimaryDomains(incomingOrg.identities);
+  const existingDomains = getVerifiedPrimaryDomains(existingOrg.identities);
 
   const incomingOrgStartDate = incomingOrg.startDate ? new Date(incomingOrg.startDate) : null
   const incomingOrgEndDate = incomingOrg.endDate ? new Date(incomingOrg.endDate) : null
   const existingOrgStartDate = existingOrg.dateStart ? new Date(existingOrg.dateStart) : null
   const existingOrgEndEndDate = existingOrg.dateEnd ? new Date(existingOrg.dateEnd) : null
 
-  const isSameStartMonthYear =
-    (!incomingOrgStartDate && !existingOrgStartDate) ||
-    (incomingOrgStartDate &&
-      existingOrgStartDate &&
-      incomingOrgStartDate.getMonth() === existingOrgStartDate.getMonth() &&
-      incomingOrgStartDate.getFullYear() === existingOrgStartDate.getFullYear())
-
-  const isSameEndMonthYear =
-    (!incomingOrgEndDate && !existingOrgEndEndDate) ||
-    (incomingOrgEndDate &&
-      existingOrgEndEndDate &&
-      incomingOrgEndDate.getMonth() === existingOrgEndEndDate.getMonth() &&
-      incomingOrgEndDate.getFullYear() === existingOrgEndEndDate.getFullYear())
+  const hasSameStartDate = isSameMonthYear(incomingOrgStartDate, existingOrgStartDate);
+  const hasSameEndDate = isSameMonthYear(incomingOrgEndDate, existingOrgEndEndDate);
 
   return (
-    hasIntersection(
-      incomingVerifiedPrimaryDomainIdentityValues,
-      existingVerifiedPrimaryDomainIdentityValues,
-    ) ||
-    ((existingOrg.orgName.toLowerCase().includes(incomingOrg.name.toLowerCase()) ||
-      incomingOrg.name.toLowerCase().includes(existingOrg.orgName.toLowerCase())) &&
-      ((isSameStartMonthYear && isSameEndMonthYear) || incomingOrg.title === existingOrg.jobTitle))
+    hasIntersection(incomingDomains, existingDomains) ||
+    (areOrganizationNamesSimilar(existingOrg.orgName, incomingOrg.name) &&
+     ((hasSameStartDate && hasSameEndDate) || incomingOrg.title === existingOrg.jobTitle))
   )
 }

Likely invalid or redundant comment.

coderabbitai · 2024-12-06T17:17:06Z

services/libs/common/src/array.ts

+export const hasIntersection = (arr1: string[], arr2: string[]): boolean => {
+  const set1 = new Set(arr1)
+  return arr2.some((item) => set1.has(item))
+}


🛠️ Refactor suggestion

Add input validation and edge case handling.

The hasIntersection function implementation is efficient using Set, but could be more robust with additional checks.

Consider this enhanced implementation:

-export const hasIntersection = (arr1: string[], arr2: string[]): boolean => { - const set1 = new Set(arr1) - return arr2.some((item) => set1.has(item)) -} +export const hasIntersection = (arr1: string[], arr2: string[]): boolean => { + // Validate inputs + if (!Array.isArray(arr1) || !Array.isArray(arr2)) { + throw new TypeError('Both arguments must be arrays') + } + + // Handle empty arrays + if (arr1.length === 0 || arr2.length === 0) { + return false + } + + // Create Set from the smaller array for better performance + const [smallerArr, largerArr] = arr1.length < arr2.length ? [arr1, arr2] : [arr2, arr1] + const set = new Set(smallerArr.map(item => item?.toLowerCase())) + + // Check for intersection with case-insensitive comparison + return largerArr.some(item => item && set.has(item.toLowerCase())) +}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export const hasIntersection = (arr1: string[], arr2: string[]): boolean => {

const set1 = new Set(arr1)

return arr2.some((item) => set1.has(item))

}

export const hasIntersection = (arr1: string[], arr2: string[]): boolean => {

// Validate inputs

if (!Array.isArray(arr1) || !Array.isArray(arr2)) {

throw new TypeError('Both arguments must be arrays')

}

// Handle empty arrays

if (arr1.length === 0 || arr2.length === 0) {

return false

}

// Create Set from the smaller array for better performance

const [smallerArr, largerArr] = arr1.length < arr2.length ? [arr1, arr2] : [arr2, arr1]

const set = new Set(smallerArr.map(item => item?.toLowerCase()))

// Check for intersection with case-insensitive comparison

return largerArr.some(item => item && set.has(item.toLowerCase()))

}

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (2)

445-485: Refactor date comparison logic for better maintainability

The date comparison logic is duplicated and could be extracted into a helper function for better maintainability.

Consider this refactoring:

+function isSameMonthYear(date1: Date | null, date2: Date | null): boolean {
+  return (!date1 && !date2) || // Both dates are null
+    (date1 && date2 &&
+      date1.getMonth() === date2.getMonth() &&
+      date1.getFullYear() === date2.getFullYear())
+}

 export function doesIncomingOrgExistInExistingOrgs(
   existingOrg: IMemberOrganizationData,
   incomingOrg: IMemberEnrichmentDataNormalizedOrganization,
 ): boolean {
   // ... existing identity filtering code ...

   const incomingOrgStartDate = incomingOrg.startDate ? new Date(incomingOrg.startDate) : null
   const incomingOrgEndDate = incomingOrg.endDate ? new Date(incomingOrg.endDate) : null
   const existingOrgStartDate = existingOrg.dateStart ? new Date(existingOrg.dateStart) : null
   const existingOrgEndEndDate = existingOrg.dateEnd ? new Date(existingOrg.dateEnd) : null

-  const isSameStartMonthYear =
-    (!incomingOrgStartDate && !existingOrgStartDate) || // Both start dates are null
-    (incomingOrgStartDate &&
-      existingOrgStartDate &&
-      incomingOrgStartDate.getMonth() === existingOrgStartDate.getMonth() &&
-      incomingOrgStartDate.getFullYear() === existingOrgStartDate.getFullYear())
+  const isSameStartMonthYear = isSameMonthYear(incomingOrgStartDate, existingOrgStartDate)
+  const isSameEndMonthYear = isSameMonthYear(incomingOrgEndDate, existingOrgEndEndDate)

-  const isSameEndMonthYear =
-    (!incomingOrgEndDate && !existingOrgEndEndDate) || // Both end dates are null
-    (incomingOrgEndDate &&
-      existingOrgEndEndDate &&
-      incomingOrgEndDate.getMonth() === existingOrgEndEndDate.getMonth() &&
-      incomingOrgEndDate.getFullYear() === existingOrgEndEndDate.getFullYear())

   return (
     hasIntersection(
       incomingVerifiedPrimaryDomainIdentityValues,
       existingVerifiedPrimaryDomainIdentityValues,
     ) ||
     ((existingOrg.orgName.toLowerCase().includes(incomingOrg.name.toLowerCase()) ||
       incomingOrg.name.toLowerCase().includes(existingOrg.orgName.toLowerCase())) &&
       ((isSameStartMonthYear && isSameEndMonthYear) || incomingOrg.title === existingOrg.jobTitle))
   )
 }

Line range hint 847-925: Consider splitting the work experience preparation logic

The prepareWorkExperiences function is becoming complex with multiple responsibilities. Consider splitting it into smaller, focused functions.

Suggested structure:

interface IWorkExperiencePreparation {
  processHighConfidenceSource(): IWorkExperienceChanges;
  processManualChanges(): IWorkExperienceChanges;
  processNewExperiences(): IWorkExperienceChanges;
}

class WorkExperiencePreparation implements IWorkExperiencePreparation {
  constructor(
    private oldVersion: IMemberOrganizationData[],
    private newVersion: IMemberEnrichmentDataNormalizedOrganization[],
    private isHighConfidenceSource: boolean
  ) {}

  prepare(): IWorkExperienceChanges {
    if (this.isHighConfidenceSource) {
      return this.processHighConfidenceSource();
    }
    return this.processRegularSource();
  }

  private processRegularSource(): IWorkExperienceChanges {
    const manualChanges = this.processManualChanges();
    const newExperiences = this.processNewExperiences();
    return this.mergeChanges(manualChanges, newExperiences);
  }
  
  // ... implement other methods
}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b8d5d2b and 7318e16.

📒 Files selected for processing (1)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (19 hunks)

🔇 Additional comments (5)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (5)

3-8: LGTM!

The new imports are properly aligned with the added functionality and follow the module's organization pattern.

239-239: LGTM: Logging level adjustments

The change from info to debug level is appropriate for these operational logs.

Also applies to: 258-258, 290-290, 308-308

940-953: ⚠️ Potential issue

Add error handling and environment validation for sync functions

The sync functions need proper error handling and environment validation.

Apply this diff:

+function validateSyncApiUrl(): string {
+  const apiUrl = process.env['CROWD_SEARCH_SYNC_API_URL'];
+  if (!apiUrl) {
+    throw new Error('CROWD_SEARCH_SYNC_API_URL environment variable is not set');
+  }
+  return apiUrl;
+}

 export async function syncMember(memberId: string): Promise<void> {
+  const apiUrl = validateSyncApiUrl();
   const syncApi = new SearchSyncApiClient({
-    baseUrl: process.env['CROWD_SEARCH_SYNC_API_URL'],
+    baseUrl: apiUrl,
   })

-  await syncApi.triggerMemberSync(memberId, { withAggs: false })
+  try {
+    await syncApi.triggerMemberSync(memberId, { withAggs: false })
+  } catch (error) {
+    svc.log.error({ error, memberId }, 'Failed to sync member')
+    throw error
+  }
 }

 export async function syncOrganization(organizationId: string): Promise<void> {
+  const apiUrl = validateSyncApiUrl();
   const syncApi = new SearchSyncApiClient({
-    baseUrl: process.env['CROWD_SEARCH_SYNC_API_URL'],
+    baseUrl: apiUrl,
   })

-  await syncApi.triggerOrganizationSync(organizationId, undefined, { withAggs: false })
+  try {
+    await syncApi.triggerOrganizationSync(organizationId, undefined, { withAggs: false })
+  } catch (error) {
+    svc.log.error({ error, organizationId }, 'Failed to sync organization')
+    throw error
+  }
 }

426-438: ⚠️ Potential issue

Add error handling for sync operations

The sync operations after member updates lack error handling. A failure in sync operations could leave the system in an inconsistent state.

Apply this diff:

-    await Promise.all(promises)
+    try {
+      await Promise.all(promises)
+    } catch (error) {
+      svc.log.error({ error, memberId }, 'Failed to process member updates')
+      throw error
+    }

     if (updated) {
       await setMemberEnrichmentUpdateDateDb(tx.transaction(), memberId)
-      await syncMember(memberId)
+      try {
+        await syncMember(memberId)
+      } catch (error) {
+        svc.log.error({ error, memberId }, 'Failed to sync member')
+        throw error
+      }
     } else {
       await setMemberEnrichmentTryDateDb(tx.transaction(), memberId)
     }

-    for (const orgId of orgIdsToSync) {
-      await syncOrganization(orgId)
+    try {
+      await Promise.all(orgIdsToSync.map(orgId => syncOrganization(orgId)))
+    } catch (error) {
+      svc.log.error({ error, orgIds: orgIdsToSync }, 'Failed to sync organizations')
+      throw error
+    }

Likely invalid or redundant comment.

230-231: Verify high confidence source handling

The new parameter isHighConfidenceSourceSelectedForWorkExperiences has significant impact on work experience data. Let's verify its usage across the codebase.

✅ Verification successful

Parameter usage is properly implemented and contained

The isHighConfidenceSourceSelectedForWorkExperiences parameter is used correctly within a well-defined scope:

It's passed from updateMemberUsingSquashedPayload to prepareWorkExperiences
In prepareWorkExperiences, it controls whether to completely replace work experiences (when true) or preserve manually created ones (when false)
The implementation is self-contained and doesn't require changes in other parts of the codebase

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check how isHighConfidenceSourceSelectedForWorkExperiences is used
# Expected: Find all calls to updateMemberUsingSquashedPayload to verify proper parameter passing

rg -A 5 "updateMemberUsingSquashedPayload.*isHighConfidenceSourceSelectedForWorkExperiences"

Length of output: 94

Script:

#!/bin/bash
# Let's first find the file containing the function definition
fd "enrichment.ts" --exec ast-grep --pattern 'function updateMemberUsingSquashedPayload($_) {
  $$$
}'

# Also search for any references to this parameter
rg "isHighConfidenceSourceSelectedForWorkExperiences"

Length of output: 670

Script:

#!/bin/bash
# Let's see the context around these usages to understand the implementation
rg -B 5 -A 10 "isHighConfidenceSourceSelectedForWorkExperiences" services/apps/members_enrichment_worker/src/activities/enrichment.ts

Length of output: 2094

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

services/libs/opensearch/src/service/organization.sync.service.ts (1)
417-422: Add error logging for skipped organizations

The null check is a good addition to prevent processing undefined organizations. However, consider adding error logging when an organization is skipped to help with debugging.
 if (base) {
   const data = await buildFullOrgForMergeSuggestions(qx, base)
   const prefixed = OrganizationSyncService.prefixData(data)
   await this.openSearchService.index(orgId, OpenSearchIndex.ORGANIZATIONS, prefixed)
+} else {
+  this.log.warn({ orgId }, 'Organization not found, skipping OpenSearch indexing')
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7318e16 and 47f9263.

📒 Files selected for processing (1)

services/libs/opensearch/src/service/organization.sync.service.ts (1 hunks)

🔇 Additional comments (1)

services/libs/opensearch/src/service/organization.sync.service.ts (1)

417-422: Verify member enrichment impact

Since this change is part of member enrichment tweaks, let's verify that skipping non-existent organizations doesn't affect member processing.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (2)
853-854: Document the high confidence source parameter.

The isHighConfidenceSourceSelectedForWorkExperiences parameter needs documentation explaining its purpose and impact on work experience processing.

Add JSDoc comment:
+/**
+ * Prepares work experiences for update by comparing old and new versions.
+ * @param oldVersion - Existing work experiences in the database
+ * @param newVersion - New work experiences from enrichment
+ * @param isHighConfidenceSourceSelectedForWorkExperiences - When true, replaces all existing work experiences with new ones
+ * @returns Object containing work experiences to delete, create, and update
+ */
 function prepareWorkExperiences(
919-920: Improve type safety in work experience updates.

Using Record<string, any> reduces type safety. Consider creating a specific interface for work experience updates.
+interface IWorkExperienceUpdate {
+  dateEnd?: string | null;
+  // add other possible update fields
+}

-const toUpdateInner: Record<string, any> = {}
+const toUpdateInner: IWorkExperienceUpdate = {}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 47f9263 and 1606a84.

📒 Files selected for processing (1)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (19 hunks)

🔇 Additional comments (3)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (3)

3-8: LGTM: Import changes are well-structured.

The new imports from @crowd/common are properly added and utilized throughout the code.

946-959: ⚠️ Potential issue

Add error handling and environment validation for sync functions.

The sync functions need proper error handling and environment variable validation.

This issue was previously identified in a past review. The same fixes for error handling and environment validation should be applied here.

239-239: Verify logging visibility after level changes.

The change from info to debug level is appropriate for these operational logs. However, ensure that these events are still adequately visible for troubleshooting when needed.

Also applies to: 258-258, 290-290, 308-308

✅ Verification successful

Based on the log level usage patterns shown in the codebase, I can now generate the final response:

Log level changes align with codebase patterns

The change from info to debug for these operational logs is consistent with the codebase's logging patterns where:

debug is used for detailed operational tracing (e.g., query execution, message processing)
info is used for significant state changes and important operations
warn is used for concerning but non-critical issues
error is used for failures and critical issues

The changed logs are appropriately set to debug as they trace routine member identity processing operations, similar to other debug-level operational logs in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other log levels in the codebase to ensure consistent logging patterns
rg -A 1 "log\.(info|debug|warn|error)" --type ts

Length of output: 70664

coderabbitai · 2024-12-07T12:31:31Z

services/apps/members_enrichment_worker/src/activities/enrichment.ts

+                orgIdsToSync.map((orgId) =>
+                  syncOrganization(orgId).catch((error) => {
+                    console.error(`Failed to sync organization with ID ${orgId}:`, error)
+                  }),
+                ),
+              ),
+            ),


⚠️ Potential issue

Improve error handling in organization sync operations.

The current error handling has several issues:

Uses console.error instead of the service logger (svc.log.error)

Silently catches errors without proper error propagation

Apply this fix:

orgIdsToSync.map((orgId) => - syncOrganization(orgId).catch((error) => { - console.error(`Failed to sync organization with ID ${orgId}:`, error) - }), + syncOrganization(orgId).catch((error) => { + svc.log.error({ error, orgId }, 'Failed to sync organization') + throw error + }),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

orgIdsToSync.map((orgId) =>

syncOrganization(orgId).catch((error) => {

console.error(`Failed to sync organization with ID ${orgId}:`, error)

}),

),

),

),

orgIdsToSync.map((orgId) =>

syncOrganization(orgId).catch((error) => {

svc.log.error({ error, orgId }, 'Failed to sync organization')

throw error

}),

),

),

),

…'s already a profile from verified

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (1)

453-494: Refactor organization matching logic for better maintainability.

The organization matching logic is complex and could benefit from:

Extracting date comparison logic into helper functions
Adding comments explaining the matching criteria

Consider this refactor:

+function isSameMonthYear(date1: Date | null, date2: Date | null): boolean {
+  return (!date1 && !date2) || // Both dates are null
+    (date1 && date2 &&
+      date1.getMonth() === date2.getMonth() &&
+      date1.getFullYear() === date2.getFullYear())
+}

 export function doesIncomingOrgExistInExistingOrgs(
   existingOrg: IMemberOrganizationData,
   incomingOrg: IMemberEnrichmentDataNormalizedOrganization,
 ): boolean {
   // Extract domain identities for comparison
   const incomingVerifiedPrimaryDomainIdentityValues = incomingOrg.identities
     .filter((i) => i.type === OrganizationIdentityType.PRIMARY_DOMAIN && i.verified)
     .map((i) => i.value)

   const existingVerifiedPrimaryDomainIdentityValues = existingOrg.identities
     .filter((i) => i.type === OrganizationIdentityType.PRIMARY_DOMAIN && i.verified)
     .map((i) => i.value)

   const incomingOrgStartDate = incomingOrg.startDate ? new Date(incomingOrg.startDate) : null
   const incomingOrgEndDate = incomingOrg.endDate ? new Date(incomingOrg.endDate) : null
   const existingOrgStartDate = existingOrg.dateStart ? new Date(existingOrg.dateStart) : null
   const existingOrgEndEndDate = existingOrg.dateEnd ? new Date(existingOrg.dateEnd) : null

-  const isSameStartMonthYear =
-    (!incomingOrgStartDate && !existingOrgStartDate) || // Both start dates are null
-    (incomingOrgStartDate &&
-      existingOrgStartDate &&
-      incomingOrgStartDate.getMonth() === existingOrgStartDate.getMonth() &&
-      incomingOrgStartDate.getFullYear() === existingOrgStartDate.getFullYear())
+  const isSameStartMonthYear = isSameMonthYear(incomingOrgStartDate, existingOrgStartDate)
+  const isSameEndMonthYear = isSameMonthYear(incomingOrgEndDate, existingOrgEndEndDate)

   return (
     hasIntersection(
       incomingVerifiedPrimaryDomainIdentityValues,
       existingVerifiedPrimaryDomainIdentityValues,
     ) ||
     ((existingOrg.orgName.toLowerCase().includes(incomingOrg.name.toLowerCase()) ||
       incomingOrg.name.toLowerCase().includes(existingOrg.orgName.toLowerCase())) &&
       ((isSameStartMonthYear && isSameEndMonthYear) || incomingOrg.title === existingOrg.jobTitle))
   )
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c87f732 and 6ae3a91.

📒 Files selected for processing (1)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (18 hunks)

🔇 Additional comments (3)

services/apps/members_enrichment_worker/src/activities/enrichment.ts (3)

Line range hint 3-46: LGTM: Import statements are properly organized.

The new imports are correctly added and are used throughout the code.

381-387: ⚠️ Potential issue

Improve error handling in organization sync operations.

The current error handling has several issues:

Uses console.error instead of the service logger
Silently catches errors without proper error propagation

Apply this fix:

 orgIdsToSync.map((orgId) =>
-  syncOrganization(orgId).catch((error) => {
-    console.error(`Failed to sync organization with ID ${orgId}:`, error)
-  }),
+  syncOrganization(orgId).catch((error) => {
+    svc.log.error({ error, orgId }, 'Failed to sync organization')
+    throw error
+  }),

Likely invalid or redundant comment.

953-966: ⚠️ Potential issue

Add error handling and environment validation for sync functions.

The sync functions need:

Error handling for API calls
Validation of environment variables

Add error handling:

 export async function syncMember(memberId: string): Promise<void> {
+  const apiUrl = process.env['CROWD_SEARCH_SYNC_API_URL']
+  if (!apiUrl) {
+    throw new Error('CROWD_SEARCH_SYNC_API_URL environment variable is not set')
+  }
+
   const syncApi = new SearchSyncApiClient({
-    baseUrl: process.env['CROWD_SEARCH_SYNC_API_URL'],
+    baseUrl: apiUrl,
   })

-  await syncApi.triggerMemberSync(memberId, { withAggs: false })
+  try {
+    await syncApi.triggerMemberSync(memberId, { withAggs: false })
+  } catch (error) {
+    svc.log.error({ error, memberId }, 'Failed to sync member')
+    throw error
+  }
 }

 export async function syncOrganization(organizationId: string): Promise<void> {
+  const apiUrl = process.env['CROWD_SEARCH_SYNC_API_URL']
+  if (!apiUrl) {
+    throw new Error('CROWD_SEARCH_SYNC_API_URL environment variable is not set')
+  }
+
   const syncApi = new SearchSyncApiClient({
-    baseUrl: process.env['CROWD_SEARCH_SYNC_API_URL'],
+    baseUrl: apiUrl,
   })

-  await syncApi.triggerOrganizationSync(organizationId, undefined, { withAggs: false })
+  try {
+    await syncApi.triggerOrganizationSync(organizationId, undefined, { withAggs: false })
+  } catch (error) {
+    svc.log.error({ error, organizationId }, 'Failed to sync organization')
+    throw error
+  }
 }

Likely invalid or redundant comment.

epipav added 25 commits November 28, 2024 16:05

add some logging

c50ab0d

log more

e66da1e

test new version

eace62d

only update manual changes when dateEnd is null

8a4bb54

profile finder and work expfixes

5473938

simplify a bit

1175d84

clean incoming attributes from llm

ca1f8d4

fix

8faed6c

update log levels

23e6cdb

fix prog normalization

00c7dc2

fix serp linkedin handle parse

21c76ba

process all members

c0abfa3

when primary domain exists don't send the additional identities as ve…

763480f

…rified to prevent multiple org matches

improve progai work exp normalization

a36056c

better replaceDoubleQuotes

6c19087

Merge branch 'main' into bugfix/check-org-existence-in-enrichment-fix

6c4750d

some logging

e8f7618

remove logging

d59ac6b

org identity squash fix

31ee17b

fix

98105dd

fix for identities inside workExperiencesSquashedByLLM

d4e7bb0

bit more fixing

73add51

use replaceDoubleQuotes on work experience free text fields

ab189a6

switch to sonnet 3.5v2 because of longer output token support

cf6f43a

final fixes, some cleaning

d8b9762

epipav requested a review from themarolt December 2, 2024 08:40

epipav added 2 commits December 2, 2024 09:43

remove unused code

4c24a9b

Merge branch 'main' into bugfix/check-org-existence-in-enrichment-fix

3b5c48a

epipav marked this pull request as ready for review December 2, 2024 09:06

epipav added 2 commits December 2, 2024 10:08

formatting

d8a63b4

Merge branch 'bugfix/check-org-existence-in-enrichment-fix' of github…

9fac18e

….com:CrowdDotDev/crowd.dev into bugfix/check-org-existence-in-enrichment-fix

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

fix linting

64aee6c

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

fix progai linkedin normalization

a859861

coderabbitai bot reviewed Dec 3, 2024

View reviewed changes

checking existent work experiences when mapping incoming new work exp…

5e2fcb0

…eriences from enrichment

coderabbitai bot reviewed Dec 5, 2024

View reviewed changes

sync member and created orgs on enrchment to opensearch

22121ea

coderabbitai bot reviewed Dec 6, 2024

View reviewed changes

epipav added 3 commits December 6, 2024 12:11

also return unverified identities on getting existing member data whi…

52549a0

…le enriching

check org existence in all platforms if the type is primary domain

3a52818

cross-checking verified domains in existing & incoming org for existe…

82aed4f

…nce optimization

coderabbitai bot reviewed Dec 6, 2024

View reviewed changes

epipav added 3 commits December 6, 2024 18:18

fix returning identities within existing work exps

71a297d

fix consistenct issue when syncing bcs of Promise.all

b8d5d2b

fix

7318e16

coderabbitai bot reviewed Dec 6, 2024

View reviewed changes

check sent orgid exists

47f9263

coderabbitai bot reviewed Dec 6, 2024

View reviewed changes

sync fix

1606a84

coderabbitai bot reviewed Dec 7, 2024

View reviewed changes

epipav added 2 commits December 7, 2024 13:31

fix

c87f732

update linkedin discarding logic to discard all unverified when there…

6ae3a91

…'s already a profile from verified

coderabbitai bot reviewed Dec 9, 2024

View reviewed changes

epipav merged commit 157de36 into main Dec 10, 2024
8 checks passed

epipav deleted the bugfix/check-org-existence-in-enrichment-fix branch December 10, 2024 08:29

-export const hasIntersection = (arr1: string[], arr2: string[]): boolean => {
-  const set1 = new Set(arr1)
-  return arr2.some((item) => set1.has(item))
-}
+export const hasIntersection = (arr1: string[], arr2: string[]): boolean => {
+  // Validate inputs
+  if (!Array.isArray(arr1) || !Array.isArray(arr2)) {
+    throw new TypeError('Both arguments must be arrays')
+  }
+  // Handle empty arrays
+  if (arr1.length === 0 || arr2.length === 0) {
+    return false
+  }
+  // Create Set from the smaller array for better performance
+  const [smallerArr, largerArr] = arr1.length < arr2.length ? [arr1, arr2] : [arr2, arr1]
+  const set = new Set(smallerArr.map(item => item?.toLowerCase()))
+  // Check for intersection with case-insensitive comparison
+  return largerArr.some(item => item && set.has(item.toLowerCase()))
+}

Final tweaks to member enrichment #2704

Final tweaks to member enrichment #2704

Uh oh!

Conversation

epipav commented Dec 2, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes proposed ✍️

What

Why

How

Checklist ✅

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested labels

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

epipav commented Dec 2, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 2, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)