Skip to content

Conversation

@trueleo
Copy link
Contributor

@trueleo trueleo commented Aug 24, 2022

Description

This PR refactors code around StorageSync, fixes instances where Pathbuf is more appropriate type to for path handling and changes filename such that it adds hostname for identification.

StorageSync is only ever required for local sync cycle. On s3 sync currently it checks for top level folder inside local data directory to retrieve stream names and respective paths, this is not ideal so this is changed so that we only go through streams that are returned through 'list_streams' in s3 storage. This partiallty fixes #54 but there should be more checks in place when loading streams.

Changes

  • Refactors StorageSync
  • Change from DirName to StorageDir for path related stuff
  • Use StorageDir in s3 sync
  • Only check for files on depth 1 when walking tmp directory
  • use hostname for file instead of random string

This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Copy link
Member

@nitisht nitisht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments


pub fn list_streams(&self) -> Vec<String> {
self.read().unwrap().keys().map(String::clone).collect()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not add unwrap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrapping here is fine because same thread does not hold any lock. Some measures can be considered to handle poisoning across all instances where read is needed.

impl Opt {
pub fn get_cache_path(&self, stream_name: &str) -> String {
format!("{}/{}", self.local_disk_path, stream_name)
pub fn get_cache_path(&self, stream_name: &str) -> PathBuf {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and below function return the exact same thing. Let's remove one of these.

}
let filename = file.file_name().unwrap().to_str().unwrap();
let file_suffix = str::replacen(filename, ".", "/", 3);
let s3_path = format!("{}/{}", stream, file_suffix);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use the join approach everywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3 path is key for s3 put object. This does not represents any local file but is generated from the filename of the files which are yet to be synced. It is fine for it to be a string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

);
}
let filename = file.file_name().unwrap().to_str().unwrap();
let file_suffix = str::replacen(filename, ".", "/", 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets avoid adding new unwraps

Copy link
Contributor Author

@trueleo trueleo Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename is a valid utf-8 filename generated when local sync happens. Only way this could fail is if someone created an invalid file in the tmp directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone could try this, we can't underestimate our users :)

@trueleo trueleo marked this pull request as ready for review August 25, 2022 08:50
@nitisht nitisht merged commit 8c4656c into parseablehq:main Aug 25, 2022
@trueleo trueleo deleted the hostname branch August 25, 2022 11:20
nitisht added a commit that referenced this pull request Aug 27, 2022
This adds multiple improvements

- Datafusion querying multiple prefixes
- Avoid dependency on external plugins for S3 
connection. 
- Ensure using complete file path inside each 
prefix (added in #55) to avoid listing calls

Co-authored-by: Satyam Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verify valid streams on startup and make sure stream is properly deleted

2 participants