[BUG] OTA API and OTA_EventProcessingTask is not task/thread safe when it comes to accessing common state.

**Describe the bug**
The OTA API and the task that is expected to be used use common data values without synchronization between tasks/threads.  The OTA implementation is **NOT** Thread/Task safe.

There is a gross error in the way portions of the `otaAgent` internal state is being read/modified/written. Portions of it are assumed to be atomic across all tasks/threads but there are no guarantees that this is the case.

There are 3 Potential tasks/threads where actions can be performed and are currently in contention:
* Application task - executing the OTA_* api - eg: OTA_Shutdown(), OTA_GetState(), OTA_SignalEvent(), OTA_ActivateNewImage(), etc.
* [OTA_EventProcessingTask()](https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/include/ota.h#L717)
* Network task (`mqtt` or `http`) executing the callbacks
* Timer task - for timers - This one is okay because all of the timers used are sending Events via a queue to the EventProcessingTask.


For the state and or callbacks there is no  synchronization barrier (eg a semaphore or mutex) of the `otaAgent` information when any of these three tasks are accessing the `otaAgent` common control block.

These values **MUST** be either specified as atomic OR consumed within a semaphore/mutex lock so that actions performed upon them by either a task calling the OTA_*() API functions or the task running `OTA_EventProcessingTask()` will not inadvertently overwrite the values - especially within code portions that have - `read - decision - write`

I'm only providing the examples pertaining to the API (App -> OTA_EventProcessingTask()) but there are most likely others between the Network registered callbacks and the OTA_EventProcessingTask() as well.

Eg: the [OTA_Init()](https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3264)

This should have something along the lines of:
```c
    if (otaAgent.lock == NULL)
    {
        otaAgent.lock = xSemaphoreCreateMutex();
        assert(otaAgent.lock != NULL);
    }
    BaseType_t semRet = xSemaphoreTake(otaAgent.lock, portMAX_DELAY);
    assert(pdTRUE == semRet);
    (void)semRet;
    // All reads and/or modifications of otaAgent and it's associated values.
    //  Lines - https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3264-L3347

    semRet            = xSemaphoreGive(otaAgent.lock);
    assert(pdTRUE == semRet);
```

Other API's that require this type of change are:

* [OTA_Shutdown](https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3365-L3402) - requires local copy of state and then return outside of semaphore/mutex lock.
* [OTA_GetState](https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3408-L3411) - requires local copy of state and then return outside of semaphore/mutex lock.
* [OTA_GetStatistics](https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3422) - otherwise portions of the stats may not be correct relative to each other. - might suggest a separate lock for this.
* [OTA_ActivateNewImage](https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3466-L3468) - requires creating a local copy of the `?? activateFn = otaAgent.pOtaInterface->pal.activate` and then using that if not null.
* [OTA_SetImageState](https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3511-L3529) - required when `setImageStateWithReason()` is used.
* [OTA_GetImageState](https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3549-L3552) - requires creating a local copy of the imageState within a lock.
* [OTA_Suspend](https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3564-L3570) - should move that code into the action performed by the `OtaAgentEventSuspend` message being received by the `OTA_EventProcessingTask`
* `OTA_Resume` - stopped here - you get the idea...
* `OTA_SignalEvent` - for the statisitcs and read of state - the stats should probably have their own lock

API that looks to be okay:
* `OTA_CheckForUpdate()`
* `OTA_Err_strerror()`
* `OTA_JobParse_strerror`
* `OTA_PalStatus_strerror`
* `OTA_OsStatus_strerror`

As mentioned, did not check any of the handlers that are registered to the network - but assuming there are most likely the same level of issue here.
 
**Host**
- Host OS: Linux - but this is ANY OS including FreeRTOS
- Version: Ubuntu 18.04

**To Reproduce**
- N/A - done by inspection, but Could reproduce by running this through Thread Sanitizer (clang) and discovering the errors.

**Expected behavior**

See Above - expected all API calls that use or modify otaAgent.* internal construct - which is used by other tasks, the access of those fields are protected by a semaphore and/or mutex.

**Screenshots**

N/A

**Wireshark logs**

N/A

**Additional context**

N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] OTA API and OTA_EventProcessingTask is not task/thread safe when it comes to accessing common state. #465

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] OTA API and OTA_EventProcessingTask is not task/thread safe when it comes to accessing common state. #465

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions