Ticket #334 (reopened defect)

Opened 12 months ago

Last modified 6 months ago

Seg fault in ncs_cpnd

Reported by: hafe Owned by: mahesh
Priority: major Milestone: PL 3.0.1
Component: CPSv Version:
Keywords: Cc:
patch waiting for maintainer: no

Description

Problem seen using 2.0.1

It seems that the shared memory info is not valid (i_offset is crazy) but used
anyway.

Any ideas?

Program terminated with signal 11, Segmentation fault.
#0 0x00002b8adf1353d1 in memcpy () from /lib64/libc.so.6
(gdb) bt
#0 0x00002b8adf1353d1 in memcpy () from /lib64/libc.so.6
#1 0x00002aabaaab4af8 in ?? ()
#2 0x000000000040f1c2 in cpnd_restart_shm_ckpt_free (cb=0x5386d0,
cp_node=0x2aaaabadf510) at ./cpnd_res.c:1251
#3 0x000000000041caf9 in cpnd_evt_proc_ckpt_destroy (cb=0x5386d0, evt=0xdfd2f0,
sinfo=0xdfd470) at ./cpnd_evt.c:4874
#4 0x00000000004121c4 in cpnd_process_evt (evt=0xdfd2e0) at ./cpnd_evt.c:337
#5 0x0000000000404b4d in cpnd_main_process (info=0x5386d0) at ./cpnd_init.c:590
#6 0x00002b8adefad143 in start_thread () from /lib64/libpthread.so.0
#7 0x00002b8adf180b8d in clone () from /lib64/libc.so.6
#8 0x0000000000000000 in ?? ()
(gdb) up
#1 0x00002aabaaab4af8 in ?? ()
(gdb)
#2 0x000000000040f1c2 in cpnd_restart_shm_ckpt_free (cb=0x5386d0,
cp_node=0x2aaaabadf510) at ./cpnd_res.c:1251
1251 ./cpnd_res.c: No such file or directory.

in ./cpnd_res.c

(gdb) info locals
ckpt_info = {ckpt_name = {length = 0, value = '\0' <repeats 255 times>}, ckpt_id

0, maxSections = 0, maxSecSize = 0, node_id = 0, offset = 0, client_bitmap


0, is_valid = 0, bm_offset = 0,

is_unlink = 0, is_close = 0, cpnd_rep_create = 0, is_first = 0, close_time =

0, next = 0}
ckpt_hdr = {num_ckpts = 1990}
rc = 1
i_offset = 4294966952
no_ckpts = 1990
(gdb) p cp_node
$1 = (CPND_CKPT_NODE *) 0x2aaaabadf510
(gdb) p *cp_node
$2 = {patnode = {bit = 9, left = 0x5cb740, right = 0x2aaaabadf510, key_info =
0x2aaaabadf530 "[a\017"}, ckpt_id = 1007963, ckpt_name = {length = 0,

value = "tappCkpt-399250", '\0' <repeats 240 times>}, create_attrib =

{creationFlags = 1, checkpointSize = 700, retentionDuration = 60000000000,
maxSections = 1, maxSectionSize = 700,

maxSectionIdSize = 256}, open_flags = 0, ckpt_lcl_ref_cnt = 0,

active_mds_dest = 570711172513473, is_active_exist = 1, replica_info = {n_secs =
0, mem_used = 0, open = {

type = NCS_OS_POSIX_SHM_REQ_OPEN, info = {open = {i_name = 0xc84300 'ÿ'

<repeats 32 times>, "5555", i_flags = 66, i_map_flags = 1, i_size = 1132,
i_offset = 0, o_addr = 0x2aaaabc01000,

o_fd = 1007, o_hdl = 0}, close = {i_hdl = 13124352, i_addr =

0x100000042, i_fd = 1132, i_size = 0}, unlink = {i_name = 0xc84300 'ÿ' <repeats
32 times>, "5555"}, read = {

i_hdl = 13124352, i_addr = 0x100000042, i_to_buff = 0x46c,

i_read_size = 0, i_offset = 0}, write = {i_hdl = 13124352, i_addr = 0x100000042,
i_from_buff = 0x46c, i_write_size = 0,

i_offset = 0}}}, shm_sec_mapping = 0x2aaaabadfa60, section_info =

0x0}, clist = 0x0, cpnd_dest_list = 0x2aaaabadfb10, cpnd_rep_create = 1,
is_unlink = 1, is_close = 0, ret_tmr = {

type = 0, tmr_id = 0x0, ckpt_id = 0, agent_dest = 0, lcl_sec_id = 0, uarg =

0, is_active = 0, write_type = 0}, is_restart = 0, is_ckpt_onscxb = 1, cur_state

0, oth_state

0,

agent_dest_list = 0x0, close_time = 0, is_rdset = 0, offset = -1, node_name =

{length = 0, value = '\0' <repeats 255 times>}, is_cpa_created_ckpt_replica = 0,
evt_bckup_q = 0x0, cpa_sinfo = {

to_svc = 0, dest = 0, stype = MDS_SENDTYPE_SND, ctxt = {length = 0 '\0',

data = '\0' <repeats 11 times>}}, cpa_sinfo_flag = 0}
(gdb) p cp_node->offset
$3 = -1
(gdb) p cb->shm_addr.ckpt_addr
$4 = (void *) 0x2aaaaaab4c4c
(gdb) printf "%x\n",i_offset
fffffea8

Attachments

Change History

Changed 12 months ago by mahesh

  • owner set to mahesh
  • status changed from new to assigned

Changed 7 months ago by hafe

Seen with 3.0.0

Changed 7 months ago by mahesh

  • status changed from assigned to closed
  • resolution set to not reproducible

Changed 7 months ago by hafe

  • status changed from closed to reopened
  • resolution not reproducible deleted
  • milestone PL 2.0.2 deleted

Reopened since problem exist in 3.0.0!

It is true that it is not reproducible with a simple use case, but we see the problem frequently all the time as soon as we start using CKPT a bit. But that does not mean we cannot do something about it!

The function cpnd_restart_shm_ckpt_free() is completely without error handling. For example if cp_node->offset==SHM_INIT which seems to be the case in this crasch, there is no error handling or check whatsoever. Or ckpt_hdr.num_ckpts can be decremented below zero with no check. The problem is that if this function starts to return error codes, no caller is prepared for that:

cpnd > grep -n cpnd_restart_shm_ckpt_free *.c
cpnd_evt.c:1218: cpnd_restart_shm_ckpt_free(cb,cp_node);
cpnd_evt.c:3937: cpnd_restart_shm_ckpt_free(cb,cp_node);
cpnd_proc.c:1496: cpnd_restart_shm_ckpt_free(cb,cp_node);
cpnd_proc.c:1531: cpnd_restart_shm_ckpt_free(cb,cp_node);
cpnd_proc.c:2401: cpnd_restart_shm_ckpt_free(cb,cp_node);
cpnd_proc.c:2454: cpnd_restart_shm_ckpt_free(cb,cp_node);

Changed 7 months ago by hafe

  • milestone set to PL 3.0.1

Changed 6 months ago by anders

On behalf of lars.g.ekman@…:

The cp_node->offset==SHM_INIT (-1) on the call, leading to the crazy i_offset.
A test on this only caused another problem; In
cpnd_evt_proc_ckpt_destroy an attempt to free memory not owned by
the 'freeer' (according to leap). The memory structure in leap does
however contain "last" owner, and that was indeed the checkpoint!

The assumption is therefore a race condition where multiple threads
calls cpnd_evt_proc_ckpt_destroy for the same checkpoint. This is also
in line with the fact that the problem occurs only on very heavy load
on the checkpoint service.

A mutex-lock was introduced in cpnd_evt_proc_ckpt_destroy and seems
to do the trick. This is however a bad solution since data, not code, should
be protected by mutex'es. The fix does however seem sufficient at the
moment for us, but this ticket requires a better solution.

Add/Change #334 (Seg fault in ncs_cpnd)

Author



Action
as reopened
Note: See TracTickets for help on using tickets.