Skip to content

Conversation

Zabuzard
Copy link
Member

@Zabuzard Zabuzard commented Oct 9, 2022

Overview

Implements and closes #621 .

This PR does two things:

  • it adds code-commands, first one is format
  • it reworks the formatter project

Code commands

UX

Once it detects code, it sends a message with buttons:

code detected

If the button is clicked, it attaches the result as embed to the bot message and disables the button:

clicked

Additionally, it auto-updates itself on every update of the original message:

edited

As well as auto-deletes if the original message is deleted.

Extension

The code was also written in a generic way, such that adding new code-actions is very simple. Here is an example with some mock commands:

multiple commands

All that has to be done is implementing a simple interface:

interface

and adding it to a list in CodeMessageHandler.

Code detection

The feature activates on messages that contain code. Code is detected by the presence of code-blocks/fences.

This matching is done without regex to save crucial performance. The check is executed on each and every single posted message after all.

Formatter rework

No file was left untouched. The actual core logic and approach still is the same though. Just a major refactoring, documentation, unit tests, extra features and bug fixes.

The basic approach is:

  • tokenize the code into its tokens (Lexer)
  • put them pack together nicely formatted (CodeSectionFormatter)

The flow is available to the user via the class Formatter.

Lexer

The core of the lexer is the enum TokenType, which lists all recognized tokens.

Tokens are found by constantly matching the next token from the code. For example int x = 5; results in a list of

INT
WHITESPACE
IDENTIFIER
WHITESPACE
ASSIGN
WHITESPACE
NUMBER
SEMICOLON

The previous proof-of-concept version of the lexer had significant performance problems, which is why matching is now done very fast and performant by:

  • using a rolling-window/string-view instead of doing real substrings (CharBuffer)
  • avoiding regex for most of the token types

Unit tests ensure, now in a very elegant way, that no type hides another as prefix (for example : is a prefix of :: and hence latter has to be matched first).

Formatting

The actual formatting happens in CodeSectionFormatter. For most of the actual logic, it refers to FormatterRules.

Essentially, it iterates through all tokens and constructs back a string. Each time, it has to decide stuff like:

  • put a space before it?
  • put a space after it?
  • put a newline after it?
  • put indent before it?

It does so by maintaining some states, such as:

  • currentIndentLevel
  • currentGenericLevel
  • isStartOfLine
  • expectedSemicolonsInLine
  • isInPackageDeclaration
  • isInImportDeclaration

It is important that the states are kept rather slim, otherwise it would be too fragile for non-compiling/incorrect code.

The actual formatter rules can be a bit nasty, since there are so many edge cases. Keep in mind though, that the goal is not to create a 100% correct formatter. The formatter has to support incorrect code and must always yield at least okay-ish looking results.

There are lots of unit tests that cover a lot of cases and real code examples.

before

after

Checklist

  • PoC
  • button UX instead of message command
  • auto-update on edit/delete
  • multi-action support
  • polish
  • add language support for buttons (can be done later, once needed)
  • rework Formatter-project
  • unit test (extractCode, formatter stuff)

@Zabuzard Zabuzard added new command Add a new command or group of commands to the bot priority: major labels Oct 9, 2022
@Zabuzard Zabuzard self-assigned this Oct 9, 2022
@Zabuzard Zabuzard force-pushed the feature/add_format_command branch 3 times, most recently from 6f2019a to 1065d15 Compare October 20, 2022 10:05
@Zabuzard
Copy link
Member Author

Sonar detected a code duplication with ScamBlocker on the fact how to implement UserInteractor without an adapter... nothing really we could do about that, its intentional.

@Zabuzard Zabuzard force-pushed the feature/add_format_command branch 2 times, most recently from 8052a5e to 968bcc1 Compare October 27, 2022 07:36
@Zabuzard Zabuzard marked this pull request as ready for review October 28, 2022 12:08
@Zabuzard Zabuzard requested review from a team as code owners October 28, 2022 12:08
@Zabuzard
Copy link
Member Author

SonarCloud Quality Gate failed. Quality Gate failed

Bug A 0 Bugs Vulnerability A 0 Vulnerabilities Security Hotspot A 0 Security Hotspots Code Smell A 1 Code Smell

0.0% 0.0% Coverage 1.3% 1.3% Duplication

Sonar detected a code duplication with ScamBlocker on the fact how to implement UserInteractor without an adapter... nothing really we could do about that, its intentional.

@Tais993 Tais993 self-requested a review October 28, 2022 12:42
@Zabuzard
Copy link
Member Author

Zabuzard commented Nov 2, 2022

@Tais993 reminder :)

@Zabuzard Zabuzard force-pushed the feature/add_format_command branch from b491bb9 to 7889183 Compare November 3, 2022 07:14
* The feature is secondary though, which is why its kept in RAM and not in the DB.
*/
private final Cache<Long, Long> originalMessageToCodeReply =
Caffeine.newBuilder().maximumSize(10_000).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice a general trend of massive caches, but is this needed? I feel like this is an easy way for trollers to "crash" our bot, at least I'd assume 10k entries takes up a lot of ram.

Copy link
Member Author

@Zabuzard Zabuzard Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tais993 10k should take almost no RAM at all, id estimate around 100 KB for a full cache. u need 1000 such caches to reach 1 GB RAM.

the point of the cache is exactly that, to prevent against RAM blowups. with a traditional Map, u have no proper limit and users are able to blow it up.

if u feel better, we can reduce it to maybe 2k? its not like it really matters for this feature anyways

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ✔️

reduced to 2k

Comment on lines 133 to 136
@Nullable CodeAction disabledAction) {
return labelToCodeAction.values().stream().map(action -> {
Button button = createButtonForAction(action, originalMessageId);
return action == disabledAction ? button.asDisabled() : button;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can only disable 1 action?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the disabled action is the action that is currently active, yes.

like, if u have 3 buttons (format, run, bytecode) and u click on run, then u want run to be deactivated, cause it is currently active.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im going to rename the param to currentlyActiveAction

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ✔️

}

private static Stream<Arguments> provideExtractCodeTests() {
return Stream.of(createExtractCodeArgumentsFor("basic", """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All honesty unreadable, honestly, using a list and using separate List#add calls would be a lot more readable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i doubt it would be more readable after spotless. but sure, can give it a try

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ✔️

i think its actually more readable now, thanks

* Indexes tokens to contain information about whether they are code tokens or not
* Formats the given string.
* <p>
* Best results are achieved for Java code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"best results", that sounds like this works for JS, but not as good.

Instead just say "Only works with Java"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"best results", that sounds like this works for JS, but not as good.

but it does. and its expected to be used for that as well.

right now, the format action is available for all languages and yields okayish results for everything i tested (except languages without semicolons)

import java.util.stream.Stream;

/**
* Queue that holds tokens to be consumed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "token" is very often used, but I haven't noticed much of a description.

I'd heavily appreciate it if you link to the Token class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats fair. i guess it was clear due to the context of lexxing (= tokenization).

Copy link
Member Author

@Zabuzard Zabuzard Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ✔️

added some extra paragraph and links

DOT("."),
SEMICOLON(";"),
METHOD_REFERENCE("::"),
COLON(":", false, true), // technically not a "real" operator but used in an enhanced for loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is needed, so that the lexxer recognizes it as individual token.

its still part of the list somewhere, i reordered a lot of stuff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found it :)

colon

@Zabuzard Zabuzard requested a review from Tais993 November 5, 2022 08:45
@Zabuzard Zabuzard force-pushed the feature/add_format_command branch from 888e183 to 9fd8e1a Compare November 5, 2022 08:50
@sonarqubecloud
Copy link

sonarqubecloud bot commented Nov 5, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
1.2% 1.2% Duplication

@Zabuzard
Copy link
Member Author

Zabuzard commented Nov 8, 2022

Going to merge this now, its quite old already and I do not want to hold back the feature for much longer, just bc people dont have time to review. There dont seem to be any red flags, so you can continue with the CR and Ill do proposed changes in a follow-up PR instead.

@Zabuzard Zabuzard merged commit cefe923 into develop Nov 8, 2022
@Zabuzard Zabuzard deleted the feature/add_format_command branch November 8, 2022 07:50
@Zabuzard Zabuzard mentioned this pull request Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new command Add a new command or group of commands to the bot priority: major

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create FormatCommand

2 participants