-
Notifications
You must be signed in to change notification settings - Fork 81
Description
Describe the bug
The OTA API and the task that is expected to be used use common data values without synchronization between tasks/threads. The OTA implementation is NOT Thread/Task safe.
There is a gross error in the way portions of the otaAgent internal state is being read/modified/written. Portions of it are assumed to be atomic across all tasks/threads but there are no guarantees that this is the case.
There are 3 Potential tasks/threads where actions can be performed and are currently in contention:
- Application task - executing the OTA_* api - eg: OTA_Shutdown(), OTA_GetState(), OTA_SignalEvent(), OTA_ActivateNewImage(), etc.
- OTA_EventProcessingTask()
- Network task (
mqttorhttp) executing the callbacks - Timer task - for timers - This one is okay because all of the timers used are sending Events via a queue to the EventProcessingTask.
For the state and or callbacks there is no synchronization barrier (eg a semaphore or mutex) of the otaAgent information when any of these three tasks are accessing the otaAgent common control block.
These values MUST be either specified as atomic OR consumed within a semaphore/mutex lock so that actions performed upon them by either a task calling the OTA_*() API functions or the task running OTA_EventProcessingTask() will not inadvertently overwrite the values - especially within code portions that have - read - decision - write
I'm only providing the examples pertaining to the API (App -> OTA_EventProcessingTask()) but there are most likely others between the Network registered callbacks and the OTA_EventProcessingTask() as well.
Eg: the OTA_Init()
This should have something along the lines of:
if (otaAgent.lock == NULL)
{
otaAgent.lock = xSemaphoreCreateMutex();
assert(otaAgent.lock != NULL);
}
BaseType_t semRet = xSemaphoreTake(otaAgent.lock, portMAX_DELAY);
assert(pdTRUE == semRet);
(void)semRet;
// All reads and/or modifications of otaAgent and it's associated values.
// Lines - https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3264-L3347
semRet = xSemaphoreGive(otaAgent.lock);
assert(pdTRUE == semRet);Other API's that require this type of change are:
- OTA_Shutdown - requires local copy of state and then return outside of semaphore/mutex lock.
- OTA_GetState - requires local copy of state and then return outside of semaphore/mutex lock.
- OTA_GetStatistics - otherwise portions of the stats may not be correct relative to each other. - might suggest a separate lock for this.
- OTA_ActivateNewImage - requires creating a local copy of the
?? activateFn = otaAgent.pOtaInterface->pal.activateand then using that if not null. - OTA_SetImageState - required when
setImageStateWithReason()is used. - OTA_GetImageState - requires creating a local copy of the imageState within a lock.
- OTA_Suspend - should move that code into the action performed by the
OtaAgentEventSuspendmessage being received by theOTA_EventProcessingTask OTA_Resume- stopped here - you get the idea...OTA_SignalEvent- for the statisitcs and read of state - the stats should probably have their own lock
API that looks to be okay:
OTA_CheckForUpdate()OTA_Err_strerror()OTA_JobParse_strerrorOTA_PalStatus_strerrorOTA_OsStatus_strerror
As mentioned, did not check any of the handlers that are registered to the network - but assuming there are most likely the same level of issue here.
Host
- Host OS: Linux - but this is ANY OS including FreeRTOS
- Version: Ubuntu 18.04
To Reproduce
- N/A - done by inspection, but Could reproduce by running this through Thread Sanitizer (clang) and discovering the errors.
Expected behavior
See Above - expected all API calls that use or modify otaAgent.* internal construct - which is used by other tasks, the access of those fields are protected by a semaphore and/or mutex.
Screenshots
N/A
Wireshark logs
N/A
Additional context
N/A