Skip to content

Conversation

@biandratti
Copy link
Contributor

@biandratti biandratti commented Oct 23, 2020

Description

We were using a volatile variable called "isClosed". When the node is shutdown, close the database, and the variable has the value "true". So usually when it receives a new request, this request validates the value of "isClosed" and throws the exception "This RocksDbDataSource has been closed".
I reproduced this error message when I used regular sync with network "mordor". And RocksDbDataSource receives at the same time a “close” request and “get” (or “update”) request for example.

Today, before validating the ReentrantReadWriteLock variable, we validated a volatile variable called "isClosed".
So, we could have a race condition problem. For example when calling the function "update" and "close" at the same time, both validate the variable "isClosed". But at this moment, the variable dbLock could be taken firstly for the function "close". This function closes the database and then unlocks the dbLock variable. Then the update function resumes but the database is closed.

Interrogant

We decided to change the behavior of variable ReentrantReadWriteLock “dbLock” from the “update” function. We began to use writeLock in place of readLock. Eventually if we detect a loss of performance for these changes, we recommend reverting this behavior and use readLock again.

Proposed Solution

  • I changed the order to avoid race conditions.
  • When the system applies shutdown, finishes the actor system and then closes the database. Usually when finalizing the actors some requests that these started remain pending without finishing. So we can see the exception “RocksDbDataSourceClosedException.
    I tried with different strategies in the coordinated-shutdown of akka, and got the same result. So I decided as a workaround to change the order of shutdown, and in the end close the database. In this way the exception in my tests disappeared.

In a future we could work in a strategy with supervisors and a strategy of coordinated-shutdown from actors system

@biandratti biandratti added the bug Something isn't working label Oct 23, 2020
@biandratti biandratti changed the title [ETCM-185] Closing twice rocks db [ETCM-185] Closing db Oct 25, 2020

override def getBestBlock(): Block =
getBlockByNumber(getBestBlockNumber()).get
getBlockByNumber(getBestBlockNumber()).getOrElse(throw new RuntimeException("block not found"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change reverted

case NonFatal(e) =>
logger.error(s"Not found associated value to a namespace: $namespace and a key: $key, cause: {}", e.getMessage)
throw new RuntimeException(e)
throw e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mirkoAlic
Copy link
Contributor

mirkoAlic commented Oct 28, 2020

Interrogant
update function, need dbLock.readLock() or dbLock.writeLock() ?

IMHO, dbLock.writeLock is the "proper" way, but tbh, I'm not sure if is not going to raise some kind of bottleneck. I suggest to apply the change, but also add a good comment about the reasoning behind it. Eventually if we detect an issue we could rollback the change. WDYT?

*/
override def close(): Unit = {
logger.debug(s"About to close DataSource in path: ${rocksDbConfig.path}")
logger.info(s"About to close DataSource in path: ${rocksDbConfig.path}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

logger.info(s"DataSource closed successfully in the path: ${rocksDbConfig.path}")
} catch {
case NonFatal(e) =>
logger.error("Not closed the DataSource properly, cause: {}", e)
Copy link
Contributor

@mirkoAlic mirkoAlic Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least as a temporal solution i think is preferable to have some kind of domain modelling for DataSource runtime exceptions, instead of just log and throw an exception; In that case, we could just throw our custom exceptions and their will provide better context about the actual issues (+ will be avoid the need of logged them)

private val logger = LoggerFactory.getLogger("rocks-db")

@volatile
private var isClosed = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about changing it to https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/AtomicReference.html
?

I have good experience with atomic refs. I remember using @volatile a very long time ago in Java and it was no reliable 🤔

Copy link
Contributor

@kapke kapke Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/AtomicBoolean.html :)

TBH I think that volatile is enough here since updates of this var don't involve a read (and that's the basic problem classes in java.util.concurrent.atomic solve - making reads and writes that depend on each other atomically)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we are now writing to this variable only after write lock is in locked state, this could probably be even a simple var without @volatile annotations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need this variable as @volatile.
In case only one thread reads and writes the value of a volatile variable and other threads only read the variable, then the reading threads are guaranteed to see the latest value written to the volatile variable.
I agree with @kapke, that volatile is enough here since updates of this var don't involve a read. Are we ok with this? Or do you think in another approach?

Copy link
Contributor

@KonradStaniec KonradStaniec Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we only modify this variable from the inside of write lock and read only from inside read lock, which has strictly stronger guarantees than volatile variable. So the variable could be simple var and we would still have more synchronization than with volatile variable.

But to be honest we can leave it volatile, it is not big deal here. (I am against atomics here as having cas operations inside locks, would really be an overkill)


class RocksDbDataSourceClosedException(val message: String) extends IllegalStateException(message)

private val logger = LoggerFactory.getLogger("rocks-db")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not extends Logger then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private val logger = LoggerFactory.getLogger("rocks-db")

@volatile
private var isClosed = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we are now writing to this variable only after write lock is in locked state, this could probably be even a simple var without @volatile annotations

override def close(): Unit = {
logger.debug(s"About to close DataSource in path: ${rocksDbConfig.path}")
log.info(s"About to close DataSource in path: ${rocksDbConfig.path}")
assureNotClosed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should not check it after locking ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, it is the same. I changed it anyway, and put the same behavior that the other functions.

Copy link
Contributor

@mirkoAlic mirkoAlic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@biandratti biandratti merged commit e279a5d into develop Nov 2, 2020
@mirkoAlic mirkoAlic deleted the ETCM-185-db-closed branch November 2, 2020 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants