Skip to content

Conversation

@rveerama1
Copy link
Contributor

@rveerama1 rveerama1 commented Apr 14, 2020

Minimal tests for TCP2 functionality added.

LCOV - code coverage report

Hit Total Coverage
Lines: 499 621 80.4%
Functions: 40 42 95.2 %
Branches: 230 405 56.8 %

@rveerama1 rveerama1 requested a review from ozhuraki April 14, 2020 15:07
@rveerama1 rveerama1 added area: Networking area: Tests Issues related to a particular existing or missing test labels Apr 14, 2020
@rveerama1 rveerama1 added this to the v2.3.0 milestone Apr 14, 2020
Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rveerama1
Copy link
Contributor Author

/home/buildslave/src/github.com/zephyrproject-rtos/modules/hal/nxp/mcux/devices/MIMXRT1062/MIMXRT1062.h:39470:50: error: expected identifier before '(' token 39470 | #define SRC ((SRC_Type *)SRC_BASE) | ^ /home/buildslave/src/github.com/zephyrproject-rtos/zephyr/subsys/net/ip/tcp2_priv.h:71:2: note: in expansion of macro 'SRC' 71 | SRC = 1,

causing CI failure.

@pfalcon
Copy link
Contributor

pfalcon commented Apr 15, 2020

zephyrproject-rtos/modules/hal/nxp/mcux/devices/MIMXRT1062/MIMXRT1062.h:39470:50: error: expected identifier before '('

causing CI failure.

@MaureenHelm, are you aware of this?

@ozhuraki
Copy link
Contributor

@rveerama1

/home/buildslave/src/github.com/zephyrproject-rtos/modules/hal/nxp/mcux/devices/MIMXRT1062/MIMXRT1062.h:39470:50: error: expected identifier before '(' token 39470 | #define SRC ((SRC_Type *)SRC_BASE) | ^ /home/buildslave/src/github.com/zephyrproject-rtos/zephyr/subsys/net/ip/tcp2_priv.h:71:2: note: in expansion of macro 'SRC' 71 | SRC = 1,
causing CI failure.

Let's rename: SRC -> TCP_EP_SRC, DST -> TCP_EP_DST

@rveerama1
Copy link
Contributor Author

Let's rename: SRC -> TCP_EP_SRC, DST -> TCP_EP_DST

Ok.

@rveerama1 rveerama1 force-pushed the tcp2_tests branch 2 times, most recently from e5a2c14 to 2e6ab81 Compare April 15, 2020 10:19
Copy link
Contributor

@ozhuraki ozhuraki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rveerama1
Copy link
Contributor Author

I couldn't see these failures earlier. after "west update" something changed itseems. I could see test fails in my system too. Investigating it.

@rveerama1
Copy link
Contributor Author

something fishy with upstream, 3f8c83a was the last commit in my local branch where there are no issues with this PR.If I rebase it crashes. Have to bisect.

@rveerama1
Copy link
Contributor Author

after git bisect commit 38031dd causing crashes in this tests.

@rveerama1
Copy link
Contributor Author

If I add CONFIG_MEM_POOL_HEAP_BACKEND=n then it passes.

@jukkar
Copy link
Member

jukkar commented Apr 15, 2020

after git bisect commit 38031dd causing crashes in this tests.

@andyross any idea about this?

@andyross
Copy link
Contributor

Is it hitting a failing allocation where it used to pass or a crash inside or due to the allocator itself?

@andyross
Copy link
Contributor

Or for that matter: a succeeding allocation where it used to fail, thus exercising code paths that weren't hit before. There were a nontrivial number of spots in the test cases where code was doing some allocation, assuming that the heap would now be full, and then being surprised when the new heap was a little more flexible and returned memory.

@andyross
Copy link
Contributor

It occurs to me that another possibility is a genuine buffer overflow in the code under test. The older mem_pool would put all the heap block memory in a contiguous region and kept the metadata separate. The new code mixes them together in a more conventional architecture. But that means that if the buffers all contain mostly-ignored bytes (i.e. they're just network buffers that don't get inspected in the test), an overflow might go undetected in the old code where in the new implementation it will corrupt the heap.

@rveerama1
Copy link
Contributor Author

The older mem_pool would put all the heap block memory in a contiguous region and kept the metadata separate. The new code mixes them together in a more conventional architecture.

It looks like that. but need to investigate it.

E: Page fault at address 0xd9d38 (error code 0x0)
E: Linear address not present in page tables
E:   PDE: 0x000000000011f827 Writable, User, Execute Enabled
E:   PTE: Non-present
E: EAX: 0x00122d18, EBX: 0x00122d18, ECX: 0x00000000, EDX: 0x000d9d38
E: ESI: 0x00122d18, EDI: 0xffff6e04, EBP: 0x00127e78, ESP: 0x00127e74
E: EFLAGS: 0x00000093 CS: 0x0008 CR3: 0x00121020

@rveerama1
Copy link
Contributor Author

@jukkar @ozhuraki I added different patch which covers other fixes too. Please take a look.

@ozhuraki
Copy link
Contributor

@rveerama1 @jukkar

To me 754c268 might be fixing the symptom instead of the root cause.

If the error is just out of size access on the allocated IPv4 tcp_endpoint by comparing it with IPv6 endpoint, the fix would be just find a place and add a check to prevent this.

Could you please check, and if it is so, consider dropping 754c268.

If done like in 754c268, I would suggest just to embed the union tcp_endpoint into struct tcp.

@jukkar
Copy link
Member

jukkar commented Apr 16, 2020

I added different patch which covers other fixes too.

The NET_SOCKADDR_MAX_SIZE is the same as sizeof(struct sockaddr_in6) if IPv6 is enabled. It is not obvious at first glance how we are overwriting the memory in this case.

@rveerama1
Copy link
Contributor Author

trigger the CI, random kernel tests are failing.

@rveerama1 rveerama1 closed this Apr 16, 2020
@rveerama1 rveerama1 reopened this Apr 16, 2020
@rveerama1
Copy link
Contributor Author

To me 754c268 might be fixing the symptom instead of the root cause.

root cause is not only comparison, also allocation.
sizeof(union tcp_endpoint) is always 24 which is equal to "sizeof(struct sockaddr_in6)" even though IPv6 disabled.
creating a endpoint for IPv4 case allocates only 8 bytes "sizeof(struct sockaddr_in)".
union tcp_endpoint *ep = tcp_calloc(1, sizeof(struct sockaddr_in)); this kind of assignment is the root cause.
After these changes tests pass (doesn't crash).

@jukkar
Copy link
Member

jukkar commented Apr 16, 2020

creating a endpoint for IPv4 case allocates only 8 bytes "sizeof(struct sockaddr_in)".
union tcp_endpoint *ep = tcp_calloc(1, sizeof(struct sockaddr_in)); this kind of assignment is the root cause.

Indeed, good catch!

@ozhuraki
Copy link
Contributor

@rveerama1 @jukkar

root cause is not only comparison, also allocation.

This was the whole point, i.e. 8 bytes are needed for IPv4, 24 for IPv6.

The root cause is most likely missing a check in tcp_endpoint_cmp().

@jukkar
Copy link
Member

jukkar commented Apr 16, 2020

This was the whole point, i.e. 8 bytes are needed for IPv4, 24 for IPv6.

Sure, but the code only allocates 8 bytes, but the struct is 24 bytes long (if IPv6 is enabled), so we overwrite the memory area by 16 bytes.

@ozhuraki
Copy link
Contributor

ozhuraki commented Apr 16, 2020

@jukkar

Sure, but the code only allocates 8 bytes, but the struct is 24 bytes long (if IPv6 is enabled), so we overwrite the memory area by 16 bytes.

The code allocates 8 or 24, and the sa_family in the most generic member of this union is used to check/ensure we are working with IPv4 or IPV6.

@rveerama1 rveerama1 closed this Apr 16, 2020
@rveerama1 rveerama1 reopened this Apr 16, 2020
@jukkar
Copy link
Member

jukkar commented Apr 16, 2020

The code allocates 8 or 24, and the sa_family in the most generic member of this union is used to check/ensure we are working with IPv4 or IPV6.

Ok, so it seems we have some code that writes past the allocated entry.

@ozhuraki
Copy link
Contributor

@rveerama1 @jukkar

Ok, so it seems we have some code that writes past the allocated entry.

I tried to debug it, it's actually this line in net_tcp_connect():

-		conn->dst->sa = *remote_addr;
+		conn->dst->sin.sin_addr = ((struct sockaddr_in *)remote_addr)->sin_addr;

@andyross
Copy link
Contributor

Note that @npitre in #24249 (which I need to finish reviewing) has some optimizations to the new heap code, one of which adds earlier corruption detection. It looks like you probably found the spot at fault, but if you need help localizing it might help to try including that series.

@rveerama1
Copy link
Contributor Author

  • conn->dst->sa = *remote_addr;
    
  • conn->dst->sin.sin_addr = ((struct sockaddr_in *)remote_addr)->sin_addr;
    

thanks @ozhuraki. I dropped my patch. please submit your patch. This PR goes through CI after your fix gets into upstream.

@ozhuraki
Copy link
Contributor

@rveerama1

thanks @ozhuraki. I dropped my patch. please submit your patch.

#24438

len is ssize_t variable type, %zd should be there.

Signed-off-by: Ravi kumar Veeramally <[email protected]>
Some other part of Zephyr has similar defines (SRC)
causing build failures.

Signed-off-by: Ravi kumar Veeramally <[email protected]>
Minimal tests for TCP2 functionality added.

Signed-off-by: Ravi kumar Veeramally <[email protected]>
@jukkar jukkar merged commit 3b3c7e7 into zephyrproject-rtos:master Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Networking area: Tests Issues related to a particular existing or missing test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants