Skip to content

[BUG] OTA API and OTA_EventProcessingTask is not task/thread safe when it comes to accessing common state. #465

@phelter

Description

@phelter

Describe the bug
The OTA API and the task that is expected to be used use common data values without synchronization between tasks/threads. The OTA implementation is NOT Thread/Task safe.

There is a gross error in the way portions of the otaAgent internal state is being read/modified/written. Portions of it are assumed to be atomic across all tasks/threads but there are no guarantees that this is the case.

There are 3 Potential tasks/threads where actions can be performed and are currently in contention:

  • Application task - executing the OTA_* api - eg: OTA_Shutdown(), OTA_GetState(), OTA_SignalEvent(), OTA_ActivateNewImage(), etc.
  • OTA_EventProcessingTask()
  • Network task (mqtt or http) executing the callbacks
  • Timer task - for timers - This one is okay because all of the timers used are sending Events via a queue to the EventProcessingTask.

For the state and or callbacks there is no synchronization barrier (eg a semaphore or mutex) of the otaAgent information when any of these three tasks are accessing the otaAgent common control block.

These values MUST be either specified as atomic OR consumed within a semaphore/mutex lock so that actions performed upon them by either a task calling the OTA_*() API functions or the task running OTA_EventProcessingTask() will not inadvertently overwrite the values - especially within code portions that have - read - decision - write

I'm only providing the examples pertaining to the API (App -> OTA_EventProcessingTask()) but there are most likely others between the Network registered callbacks and the OTA_EventProcessingTask() as well.

Eg: the OTA_Init()

This should have something along the lines of:

    if (otaAgent.lock == NULL)
    {
        otaAgent.lock = xSemaphoreCreateMutex();
        assert(otaAgent.lock != NULL);
    }
    BaseType_t semRet = xSemaphoreTake(otaAgent.lock, portMAX_DELAY);
    assert(pdTRUE == semRet);
    (void)semRet;
    // All reads and/or modifications of otaAgent and it's associated values.
    //  Lines - https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3264-L3347

    semRet            = xSemaphoreGive(otaAgent.lock);
    assert(pdTRUE == semRet);

Other API's that require this type of change are:

  • OTA_Shutdown - requires local copy of state and then return outside of semaphore/mutex lock.
  • OTA_GetState - requires local copy of state and then return outside of semaphore/mutex lock.
  • OTA_GetStatistics - otherwise portions of the stats may not be correct relative to each other. - might suggest a separate lock for this.
  • OTA_ActivateNewImage - requires creating a local copy of the ?? activateFn = otaAgent.pOtaInterface->pal.activate and then using that if not null.
  • OTA_SetImageState - required when setImageStateWithReason() is used.
  • OTA_GetImageState - requires creating a local copy of the imageState within a lock.
  • OTA_Suspend - should move that code into the action performed by the OtaAgentEventSuspend message being received by the OTA_EventProcessingTask
  • OTA_Resume - stopped here - you get the idea...
  • OTA_SignalEvent - for the statisitcs and read of state - the stats should probably have their own lock

API that looks to be okay:

  • OTA_CheckForUpdate()
  • OTA_Err_strerror()
  • OTA_JobParse_strerror
  • OTA_PalStatus_strerror
  • OTA_OsStatus_strerror

As mentioned, did not check any of the handlers that are registered to the network - but assuming there are most likely the same level of issue here.

Host

  • Host OS: Linux - but this is ANY OS including FreeRTOS
  • Version: Ubuntu 18.04

To Reproduce

  • N/A - done by inspection, but Could reproduce by running this through Thread Sanitizer (clang) and discovering the errors.

Expected behavior

See Above - expected all API calls that use or modify otaAgent.* internal construct - which is used by other tasks, the access of those fields are protected by a semaphore and/or mutex.

Screenshots

N/A

Wireshark logs

N/A

Additional context

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions