redis

Commit Graph

Author	SHA1	Message	Date
Binbin	804110a487	Allocate Lua VM code with jemalloc instead of libc, and count it used memory (#13133 ) ## Background 1. Currently Lua memory control does not pass through Redis's zmalloc.c. Redis maxmemory cannot limit memory problems caused by users abusing lua since these lua VM memory is not part of used_memory. 2. Since jemalloc is much better (fragmentation and speed), and also we know it and trust it. we are going to use jemalloc instead of libc to allocate the Lua VM code and count it used memory. ## Process: In this PR, we will use jemalloc in lua. 1. Create an arena for all lua vm (script and function), which is shared, in order to avoid blocking defragger. 2. Create a bound tcache for the lua VM, since the lua VM and the main thread are by default in the same tcache, and if there is no isolated tcache, lua may request memory from the tcache which has just been freed by main thread, and vice versa On the other hand, since lua vm might be release in bio thread, but tcache is not thread-safe, we need to recreate the tcache every time we recreate the lua vm. 3. Remove lua memory statistics from memory fragmentation statistics to avoid the effects of lua memory fragmentation ## Other Add the following new fields to `INFO DEBUG` (we may promote them to INFO MEMORY some day) 1. allocator_allocated_lua: total number of bytes allocated of lua arena 2. allocator_active_lua: total number of bytes in active pages allocated in lua arena 3. allocator_resident_lua: maximum number of bytes in physically resident data pages mapped in lua arena 4. allocator_frag_bytes_lua: fragment bytes in lua arena This is oranagra's idea, and i got some help from sundb. This solves the third point in #13102. --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2024-04-16 12:43:33 +03:00
debing.sun	4581d43230	Fix daylight race condition and some thread leaks (#13191 ) fix some issues that come from sanitizer thread report. 1. when the main thread is updating daylight_active, other threads (bio, module thread) may be writing logs at the same time. ``` WARNING: ThreadSanitizer: data race (pid=661064) Read of size 4 at 0x55c9a4d11c70 by thread T2: #0 serverLogRaw /home/sundb/data/redis_fork/src/server.c:116 (redis-server+0x8d797) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #1 _serverLog.constprop.2 /home/sundb/data/redis_fork/src/server.c:146 (redis-server+0x2a3b14) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #2 bioProcessBackgroundJobs /home/sundb/data/redis_fork/src/bio.c:329 (redis-server+0x1c24ca) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) Previous write of size 4 at 0x55c9a4d11c70 by main thread (mutexes: write M0, write M1, write M2, write M3): #0 updateCachedTimeWithUs /home/sundb/data/redis_fork/src/server.c:1102 (redis-server+0x925e7) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #1 updateCachedTimeWithUs /home/sundb/data/redis_fork/src/server.c:1087 (redis-server+0x925e7) #2 updateCachedTime /home/sundb/data/redis_fork/src/server.c:1118 (redis-server+0x925e7) #3 afterSleep /home/sundb/data/redis_fork/src/server.c:1811 (redis-server+0x925e7) #4 aeProcessEvents /home/sundb/data/redis_fork/src/ae.c:389 (redis-server+0x85ae0) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #5 aeProcessEvents /home/sundb/data/redis_fork/src/ae.c:342 (redis-server+0x85ae0) #6 aeMain /home/sundb/data/redis_fork/src/ae.c:477 (redis-server+0x85ae0) #7 main /home/sundb/data/redis_fork/src/server.c:7211 (redis-server+0x7168c) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) ``` 2. thread leaks in module tests ``` WARNING: ThreadSanitizer: thread leak (pid=668683) Thread T13 (tid=670041, finished) created by main thread at: #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1036 (libtsan.so.2+0x3d179) (BuildId: 28a9f70061dbb2dfa2cef661d3b23aff4ea13536) #1 HelloBlockNoTracking_RedisCommand /home/sundb/data/redis_fork/tests/modules/blockonbackground.c:200 (blockonbackground.so+0x97fd) (BuildId: 9cd187906c57e88cdf896d121d1d96448b37a136) #2 HelloBlockNoTracking_RedisCommand /home/sundb/data/redis_fork/tests/modules/blockonbackground.c:169 (blockonbackground.so+0x97fd) #3 call /home/sundb/data/redis_fork/src/server.c:3546 (redis-server+0x9b7fb) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #4 processCommand /home/sundb/data/redis_fork/src/server.c:4176 (redis-server+0xa091c) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #5 processCommandAndResetClient /home/sundb/data/redis_fork/src/networking.c:2468 (redis-server+0xd2b8e) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #6 processInputBuffer /home/sundb/data/redis_fork/src/networking.c:2576 (redis-server+0xd2b8e) #7 readQueryFromClient /home/sundb/data/redis_fork/src/networking.c:2722 (redis-server+0xd358f) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #8 callHandler /home/sundb/data/redis_fork/src/connhelpers.h:58 (redis-server+0x288a7b) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #9 connSocketEventHandler /home/sundb/data/redis_fork/src/socket.c:277 (redis-server+0x288a7b) #10 aeProcessEvents /home/sundb/data/redis_fork/src/ae.c:417 (redis-server+0x85b45) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) #11 aeProcessEvents /home/sundb/data/redis_fork/src/ae.c:342 (redis-server+0x85b45) #12 aeMain /home/sundb/data/redis_fork/src/ae.c:477 (redis-server+0x85b45) #13 main /home/sundb/data/redis_fork/src/server.c:7211 (redis-server+0x7168c) (BuildId: dca0b1945ba30010e36129bdb296e488dd2b32d0) ```	2024-04-04 13:49:51 +03:00
Pieter Cailliau	0b34396924	Change license from BSD-3 to dual RSALv2+SSPLv1 (#13157 ) [Read more about the license change here](https://redis.com/blog/redis-adopts-dual-source-available-licensing/) Live long and prosper 🖖	2024-03-20 22:38:24 +00:00
Binbin	e04d41d78d	Prevent lua error_reply abuse from causing errorstats to become larger (#13141 ) Users who abuse lua error_reply will generate a new error object on each error call, which can make server.errors get bigger and bigger. This will cause the server to block when calling INFO (we also return errorstats by default). To prevent the damage it can cause, when a misuse is detected, we will print a warning log and disable the errorstats to avoid adding more new errors. It can be re-enabled via CONFIG RESETSTAT. Because server.errors may be very large (it may be better now since we have the limit), config resetstat may block for a while. So in resetErrorTableStats, we will try to lazyfree server.errors. See the related discussion at the end of #8217.	2024-03-19 08:18:22 +02:00
Binbin	7b070423b8	Fix dictionary use-after-free in active expire and make kvstore iter to respect EMPTY flag (#13135 ) After #13072, there is an use-after-free error. In expireScanCallback, we will delete the dict, and then in dictScan we will continue to use the dict, like we will doing `dictResumeRehashing(d)` in the end, this casued an error. In this PR, in freeDictIfNeeded, if the dict's pauserehash is set, don't delete the dict yet, and then when scan returns try to delete it again. At the same time, we noticed that there will be similar problems in iterator. We may also delete elements during the iteration process, causing the dict to be deleted, so the part related to iter in the PR has also been modified. dictResetIterator was also missing from the previous kvstoreIteratorNextDict, we currently have no scenario that elements will be deleted in kvstoreIterator process, deal with it together to avoid future problems. Added some simple tests to verify the changes. In addition, the modification in #13072 omitted initTempDb and emptyDbAsync, and they were also added. This PR also remove the slow flag from the expire test (consumes 1.3s) so that problems can be found in CI in the future.	2024-03-18 17:41:54 +02:00
Binbin	3b3d16f748	Add KVSTORE_FREE_EMPTY_DICTS to cluster mode keys / expires kvstore (#13072 ) Currently (following #11695, and #12822), keys kvstore and expires kvstore both flag with ON_DEMAND, it means that a cluster node will only allocate a dict when the slot is assigned to it and populated, but on the other hand, when the slot is unassigned, the dict will remain allocated. We considered releasing the dict when the slot is unassigned, but it causes complications on replicas. On the other hand, from benchmarks we conducted, it looks like the performance impact of releasing the dict when it becomes empty and re-allocate it when a key is added again, isn't huge. This PR add KVSTORE_FREE_EMPTY_DICTS to cluster mode keys / expires kvstore. The impact is about about 2% performance drop, for this hopefully uncommon scenario. --------- Co-authored-by: Oran Agra <oran@redislabs.com>	2024-03-13 08:30:20 +02:00
Binbin	ad28d222ed	Lua eval scripts first in first out LRU eviction (#13108 ) In some cases, users will abuse lua eval. Each EVAL call generates a new lua script, which is added to the lua interpreter and cached to redis-server, consuming a large amount of memory over time. Since EVAL is mostly the one that abuses the lua cache, and these won't have pipeline issues (i.e. the script won't disappear unexpectedly, and cause errors like it would with SCRIPT LOAD and EVALSHA), we implement a plain FIFO LRU eviction only for these (not for scripts loaded with SCRIPT LOAD). ### Implementation notes: When not abused we'll probably have less than 100 scripts, and when abused we'll have many thousands. So we use a hard coded value of 500 scripts. And considering that we don't have many scripts, then unlike keys, we don't need to worry about the memory usage of keeping a true sorted LRU linked list. We compute the SHA of each script anyway, and put the script in a dict, we can store a listNode there, and use it for quick removal and re-insertion into an LRU list each time the script is used. ### New interfaces: At the same time, a new `evicted_scripts` field is added to INFO, which represents the number of evicted eval scripts. Users can check it to see if they are abusing EVAL. ### benchmark: `./src/redis-benchmark -P 10 -n 1000000 -r 10000000000 eval "return __rand_int__" 0` The simple abuse of eval benchmark test that will create 1 million EVAL scripts. The performance has been improved by 50%, and the max latency has dropped from 500ms to 13ms (this may be caused by table expansion inside Lua when the number of scripts is large). And in the INFO memory, it used to consume 120MB (server cache) + 310MB (lua engine), but now it only consumes 70KB (server cache) + 210KB (lua_engine) because of the scripts eviction. For non-abusive case of about 100 EVAL scripts, there's no noticeable change in performance or memory usage. ### unlikely potentially breaking change: in theory, a user can maybe load a script with EVAL and then use EVALSHA to call it (by calculating the SHA1 value on the client side), it could be that if we read the docs carefully we'll realized it's a valid scenario, but we suppose it's extremely rare. So it may happen that EVALSHA acts on a script created by EVAL, and the script is evicted and EVALSHA returns a NOSCRIPT error. that is if you have more than 500 scripts being used in the same transaction / pipeline. This solves the second point in #13102.	2024-03-13 08:27:41 +02:00
Chen Tianjie	4cae99e785	Add overhead of all DBs and rehashing dict count to info. (#12913 ) Sometimes we need to make fast judgement about why Redis is suddenly taking more memory. One of the reasons is main DB's dicts doing rehashing. We may use `MEMORY STATS` to monitor the overhead memory of each DB, but there still lacks a total sum to show an overall trend. So this PR adds the total overhead of all DBs to `INFO MEMORY` section, together with the total count of rehashing DB dicts, providing some intuitive metrics about main dicts rehashing. This PR adds the following metrics to INFO MEMORY * `mem_overhead_db_hashtable_rehashing` - only size of ht[0] in dictionaries we're rehashing (i.e. the memory that's gonna get released soon) and a similar ones to MEMORY STATS: * `overhead.db.hashtable.lut` (complements the existing `overhead.hashtable.main` and `overhead.hashtable.expires` which also counts the `dictEntry` structs too) * `overhead.db.hashtable.rehashing` - temporary rehashing overhead. * `db.dict.rehashing.count` - number of top level dictionaries being rehashed. --------- Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2024-03-01 13:41:24 +08:00
debing.sun	f6785df663	Defragger improvements around large bins (#12996 ) Implement #12963 ## Changes 1. large bins don't have external fragmentation or are at least non-defraggable, so we should ignore the effect of large bins when measuring fragmentation, and only measure fragmentation of small bins. this affects both the allocator_frag* metrics and also the active-defrag trigger 2. Adding INFO metrics for `muzzy` memory, which is memory returned to the OS but still shows as RSS until the OS reclaims it. --------- Co-authored-by: Oran Agra <oran@redislabs.com>	2024-02-20 18:11:09 +02:00
zhaozhao.zz	8876d264ac	Calculate the incremental rehash time more precisely (#13063 ) In the `databasesCron()`, the time consumed by `kvstoreIncrementallyRehash()` is used to calculate the exit condition. However, within `kvstoreIncrementallyRehash()`, the loop first checks for timeout before performing rehashing. Therefore, the time for the last rehash isn't accounted for, making the consumed time inaccurate. We need to precisely calculate all the time spent on rehashing. Additionally, the time allocated to `kvstoreIncrementallyRehash()` should be the remaining time, which is `INCREMENTAL_REHASHING_THRESHOLD_US` minus the already consumed `elapsed_us`.	2024-02-19 14:29:54 +02:00
Binbin	9103ccc398	AOF_FSYNC_EVERYSEC higher resolution, change aof_last_fsync and aof_flush_postponed_start to use mstime (#13041 ) Currently aof_last_fsync is using a low resolution unixtime is really bad, it checks if the absolute number of (full) seconds changed by one. depending on which side of the second barrier it falls, we can get very different results. This PR change the resolution to use milliseconds instead of complete seconds. In cases where the event loop cycle duration is short and their rapid (e.g. running many fast commands with short pipeline, or a high `hz` config), this change will not make much difference, since in anyway, we'll be quick to detect that we're on a "new second", and it's likely that these fsync will always be executed close to the second switch barrier. But in cases of rare or slow event loops cycles (e.g. either slow commands, or very low rate of traffic to redis, and low `hz`), it could easily be that with the old code, in some cases we'll have over 1.5 seconds between fsyncs, and in others less than 0.5. see discussion in #8612 This PR also handle aof_flush_postponed_start as well, the damage there is smaller since the threshold is 2 seconds, and not 1. --------- Co-authored-by: Oran Agra <oran@redislabs.com>	2024-02-18 12:08:29 +02:00
zhaozhao.zz	50d6fe8c4b	Add metrics for WATCH (#12966 ) Redis has some special commands that mark the client's state, such as `subscribe` and `blpop`, which mark the client as `CLIENT_PUBSUB` or `CLIENT_BLOCKED`, and we have metrics for the special use cases. However, there are also other special commands, like `WATCH`, which although do not have a specific flags, and should also be considered stateful client types. For stateful clients, in many scenarios, the connections cannot be shared in "connection pool", meaning connection pool cannot be used. For example, whenever the `WATCH` command is executed, a new connection is required to put the client into the "watch state" because the watched keys are stored in the client. If different business logic requires watching different keys, separate connections must be used; otherwise, there will be contamination. This also means that if a user's business heavily relies on the `WATCH` command, a large number of connections will be required. Recently we have encountered this situation in our platform, where some users consume a significant number of connections when using Redis because of `WATCH`. I hope we can have a way to observe these special use cases and special client connections. Here I add a few monitoring metrics: 1. `watching_clients` in `INFO` reply: The number of clients currently in the "watching" state. 2. `total_watched_keys` in `INFO` reply: The total number of keys being watched. 3. `watch` in `CLIENT LIST` reply: The number of keys each client is currently watching.	2024-02-18 10:36:41 +02:00
Binbin	493e31e3ad	Add new DEBUG dict-resizing command to disable the dict resize (#13043 ) The test fails here and there: ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. ``` There are two case: 1. In the case of passing the test, we use child process to avoid the dict resize, but it can not completely limit it, since in the dictDelete we still have chance to trigger the resize (hit the force radio). The reason why our test passed before is because the expire dict is still in the rehashing process, so the dictDelete, the dictShrinkIfNeeded can not trigger the resize. 2. In the case of failing the test, the expire dict finished the rehashing, so the last dictDelete, the dictShrinkIfNeeded trigger the dict resize since it hit the force radio, so the skipping logic fail. This PR add a new DEBUG command to disbale the dict resize.	2024-02-08 16:39:58 +02:00
Binbin	13bd3643c2	Re-compute active_defrag_running after adjusting defrag configurations (#13020 ) Currently, once active defrag starts, we can not adjust active_defrag_running downwards. This is because active_defrag_running will be dynamically compute based on the fragmentation, we think we should not lower the effort when the fragmentation drops. However, we need to note that active_defrag_running will also be dynamically computed based on configurations. In this case, we are not respecting cycle-min or cycle-max. Some people may realize halfway through that defrag consumes a lot and want to adjust it. Previously we could only turn off activedefrag and then turn it on again to adjust active_defrag_running downwards. So in this PR, when a active defrag configuration change is made, we will re-compute it. These configuration items are: - active-defrag-cycle-min - active-defrag-cycle-max - active-defrag-threshold-upper	2024-02-06 13:39:07 +02:00
guybe7	8cd62f82ca	Refactor the per-slot dict-array db.c into a new kvstore data structure (#12822 ) # Description Gather most of the scattered `redisDb`-related code from the per-slot dict PR (#11695) and turn it to a new data structure, `kvstore`. i.e. it's a class that represents an array of dictionaries. # Motivation The main motivation is code cleanliness, the idea of using an array of dictionaries is very well-suited to becoming a self-contained data structure. This allowed cleaning some ugly code, among others: loops that run twice on the main dict and expires dict, and duplicate code for allocating and releasing this data structure. # Notes 1. This PR reverts the part of https://github.com/redis/redis/pull/12848 where the `rehashing` list is global (handling rehashing `dict`s is under the responsibility of `kvstore`, and should not be managed by the server) 2. This PR also replaces the type of `server.pubsubshard_channels` from `dict**` to `kvstore` (original PR: https://github.com/redis/redis/pull/12804). After that was done, server.pubsub_channels was also chosen to be a `kvstore` (with only one `dict`, which seems odd) just to make the code cleaner by making it the same type as `server.pubsubshard_channels`, see `pubsubtype.serverPubSubChannels` 3. the keys and expires kvstores are currenlty configured to allocate the individual dicts only when the first key is added (unlike before, in which they allocated them in advance), but they won't release them when the last key is deleted. Worth mentioning that due to the recent change the reply of DEBUG HTSTATS changed, in case no keys were ever added to the db. before: ``` 127.0.0.1:6379> DEBUG htstats 9 [Dictionary HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ``` after: ``` 127.0.0.1:6379> DEBUG htstats 9 [Dictionary HT] [Expires HT] ```	2024-02-05 17:21:35 +02:00
Binbin	492021db95	Fix blocking commands timeout is reset due to re-processing command (#13004 ) In #11012, we will reprocess command when client is unblocked on keys, in some blocking commands, for example, in the XREADGROUP BLOCK scenario, because of the re-processing command, we will recalculate the block timeout, causing the blocking time to be reset. This commit add a new CLIENT_REPROCESSING_COMMAND clent flag, explicitly let the command know that it is being re-processed, later in blockForKeys we will not reset the timeout. Affected BLOCK cases: - list / zset / stream, added test cases for each. Unaffected cases: - module (never re-process the commands). - WAIT / WAITAOF (never re-process the commands). Fixes #12998.	2024-01-30 11:32:59 +02:00
Chen Tianjie	af7ceeb765	Optimize resizing hash table to resize not only non-empty dicts. (#12819 ) The function `tryResizeHashTables` only attempts to shrink the dicts that has keys (change from #11695), this was a serious problem until the change in #12850 since it meant if all keys are deleted, we won't shrink the dick. But still, both dictShrink and dictExpand may be blocked by a fork child process, therefore, the cron job needs to perform both dictShrink and dictExpand, for not just non-empty dicts, but all dicts in DBs. What this PR does: 1. Try to resize all dicts in DBs (not just non-empty ones, as it was since #12850) 2. handle both shrink and expand (not just shrink, as it was since forever) 3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink` `htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded` `dictExpandIfNeeded` which already contains all the code of those functions we get rid of, to make APIs more neat) 4. In the `Don't rehash if redis has child process` test, now that cron would do resizing, we no longer need to write to DB after the child process got killed, and can wait for the cron to expand the hash table.	2024-01-29 21:02:07 +02:00
zhaozhao.zz	85a834bfa2	Revert multi OOM limit and add multi buffer limit (#12961 ) Fix #9926 , and introduce an alternative method to prevent abuse of transactions: 1. revert #5454 (which was blocking read-only transactions in OOM state), and break the tie of MULTI state memory usage and the server OOM state. Meaning that we'll limit the total memory a single client can queue, and do that unconditionally regardless of the server being OOM or not. 2. to prevent abuse of transactions, we use the `client-query-buffer-limit` to restrict the size of the transaction. Because the commands cached in the MULTI/EXEC queue have not been executed yet, so they are also considered a part of the "query buffer" in a broader sense. In other words, the commands in the MULTI queue and the `querybuf` of the client together constitute the "query buffer". When they exceed the limit, the connection will be disconnected. The reasoning is that it's sensible to sends a single command with a huge (1GB) argument, and it's sensible to sends a transaction with many small commands, but it's probably not common to sends a long transaction with many huge arguments (will consume a lot of memory before even being executed). If anyone runs into that, they can simply increase the `client-query-buffer-limit` config. P.S. To prevent DDoS attacks, unauthenticated clients have a separate hard limit. Their query buffer should not exceed a maximum of 1MB. In other words, if the query buffer of an unauthenticated client exceeds 1MB or the `client-query-buffer-limit` (if it is set to a value smaller than 1MB,), the connection will be disconnected.	2024-01-25 11:17:39 +02:00
Binbin	628c0dea1b	Some cleanups around function (#12940 ) This PR did some cleanups around function: - drop the comment about Libraries Ctx, since we do have comment in functionsLibCtx, no need to maintain multiple copies. - remove outdated comment about the dropped Library description. - remove unused desc and code vars in functionExtractLibMetaData. - fix engines_nemory typo, changed it to engines_memory. - remove outdated comment about FUNCTION CREATE and FUNCTION INFO, FUNCTION CREATE was renamed to FUNCTION LOAD. - Check in initServer whether the return of functionsInit is OK.	2024-01-23 14:26:33 +02:00
Yanqi Lv	b07174afc2	Change the threshold of dict expand, shrink and rehash (#12948 ) Before this change (most recently modified in https://github.com/redis/redis/pull/12850#discussion_r1421406393), The trigger for normal expand threshold was 100% utilization and the trigger for normal shrink threshold was 10% (HASHTABLE_MIN_FILL). While during fork (DICT_RESIZE_AVOID), when we want to avoid rehash, the trigger thresholds were multiplied by 5 (`dict_force_resize_ratio`), meaning 500% for expand and 2% (100/10/5) for shrink. However, in `dictRehash` (the incremental rehashing), the rehashing threshold for shrinking during fork (DICT_RESIZE_AVOID) was 20% by mistake. This meant that if a shrinking is triggered when `dict_can_resize` is `DICT_RESIZE_ENABLE` which the threshold is 10%, the rehashing can continue when `dict_can_resize` is `DICT_RESIZE_AVOID`. This would cause unwanted CopyOnWrite damage. It'll make sense to change the thresholds of the rehash trigger and the thresholds of the incremental rehashing the same, however, in one we compare the size of the hash table to the number of records, and in the other we compare the size of ht[0] to the size of ht[1], so the formula is not exactly the same. to make things easier we change all the thresholds to powers of 2, so the normal shrinking threshold is changed from 100/10 (i.e. 10%) to 100/8 (i.e. 12.5%), and we change the threshold during forks from 5 to 4, i.e. from 500% to 400% for expand, and from 2% (100/10/5) to 3.125% (100/8/4)	2024-01-19 17:00:43 +02:00
debing.sun	d0640029dc	Fix race condition issues between the main thread and module threads (#12817 ) Fix #12785 and other race condition issues. See the following isolated comments. The following report was obtained using SANITIZER thread. ```sh make SANITIZER=thread ./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate ``` 1. Fixed thread-safe issue in RM_UnblockClient() Related discussion: https://github.com/redis/redis/pull/12817#issuecomment-1831181220 * When blocking a client in a module using `RM_BlockClientOnKeys()` or `RM_BlockClientOnKeysWithFlags()` with a timeout_callback, calling RM_UnblockClient() in module threads can lead to race conditions in `updateStatsOnUnblock()`. - Introduced: Version: 6.2 PR: #7491 - Touch: `server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Harm Level: High Potentially corrupts the memory data of `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Solution: Differentiate whether the call to moduleBlockedClientTimedOut() comes from the module or the main thread. Since we can't know if RM_UnblockClient() comes from module threads, we always assume it does and let `updateStatsOnUnblock()` asynchronously update the unblock status. * When error reply is called in timeout_callback(), ctx is not thread-safe, eventually lead to race conditions in `afterErrorReply`. - Introduced: Version: 6.2 PR: #8217 - Touch `server.stat_total_error_replies`, `server.errors`, - Harm Level: High Potentially corrupts the memory data of `server.errors` - Solution: Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`, and asynchronously reply errors to the client. 2. Made RM_Reply() family API thread-safe Related discussion: https://github.com/redis/redis/pull/12817#discussion_r1408707239 Call chain: `RM_Reply()` -> `_addReplyToBufferOrList()` -> touch server.current_client - Introduced: Version: 7.2.0 PR: #12326 - Harm Level: None Since the module fake client won't have the `CLIENT_PUSHING` flag, even if we touch server.current_client, we can still exit after `c->flags & CLIENT_PUSHING`. - Solution Checking `c->flags & CLIENT_PUSHING` earlier. 3. Made freeClient() thread-safe Fix #12785 - Introduced: Version: 4.0 Commit: `3fcf959e60` - Harm Level: Moderate * Trigger assertion It happens when the module thread calls freeClient while the io-thread is in progress, which just triggers an assertion, and doesn't make any race condiaions. * Touch `server.current_client`, `server.stat_clients_type_memory`, and `clientMemUsageBucket->clients`. It happens between the main thread and the module threads, may cause data corruption. 1. Error reset `server.current_client` to NULL, but theoretically this won't happen, because the module has already reset `server.current_client` to old value before entering freeClient. 2. corrupts `clientMemUsageBucket->clients` in updateClientMemUsageAndBucket(). 3. Causes server.stat_clients_type_memory memory statistics to be inaccurate. - Solution: * No longer counts memory usage on fake clients, to avoid updating `server.stat_clients_type_memory` in freeClient. * No longer resetting `server.current_client` in unlinkClient, because the fake client won't be evicted or disconnected in the mid of the process. * Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is not a fake client. 4. Fixed free client args without GIL Related discussion: https://github.com/redis/redis/pull/12817#discussion_r1408706695 When freeing retained strings in the module thread (refcount decr), or using them in some way (refcount incr), we should do so while holding the GIL, otherwise, they might be simultaneously freed while the main thread is processing the unblock client state. - Introduced: Version: 6.2.0 PR: #8141 - Harm Level: Low Trigger assertion or double free or memory leak. - Solution: Documenting that module API users need to ensure any access to these retained strings is done with the GIL locked 5. Fix adding fake client to server.clients_pending_write It will incorrectly log the memory usage for the fake client. Related discussion: https://github.com/redis/redis/pull/12817#issuecomment-1851899163 - Introduced: Version: 4.0 Commit: `9b01b64430` - Harm Level: None Only result in NOP - Solution: * Don't add fake client into server.clients_pending_write * Add c->conn assertion for updateClientMemUsageAndBucket() and updateClientMemoryUsage() to avoid same issue in the future. So now it will be the responsibility of the caller of both of them to avoid passing in fake client. 6. Fix calling RM_BlockedClientMeasureTimeStart() and RM_BlockedClientMeasureTimeEnd() without GIL - Introduced: Version: 6.2 PR: #7491 - Harm Level: Low Causes inaccuracies in command latency histogram and slow logs, but does not corrupt memory. - Solution: Module API users, if know that non-thread-safe APIs will be used in multi-threading, need to take responsibility for protecting them with their own locks instead of the GIL, as using the GIL is too expensive. ### Other issue 1. RM_Yield is not thread-safe, fixed via #12905. ### Summarize 1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and `RM_Yield`, potentially preventing memory corruption, data disorder, or assertion. 2. Updated docs and module test to clarify module API users' responsibility for locking non-thread-safe APIs in multi-threading, such as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(), RM_RetainString(), and RM_HoldString(). ### About backpot to 7.2 1. The implement of (1) is not too satisfying, would like to get more eyes. 2. (2), (3) can be safely for backport 3. (4), (6) just modifying the module tests and updating the documentation, no need for a backpot. 4. (5) is harmless, no need for a backpot. --------- Co-authored-by: Oran Agra <oran@redislabs.com>	2024-01-19 15:12:49 +02:00
Binbin	14b1edfd99	Fix dict resize ratio checks, avoid precision loss from integer division (#12952 ) In the past we used integers to compare ratios, let us assume that we have the following data in expanding: ``` used / size > 5 `80 / 16 > 5` is false `81 / 16 > 5` is false `95 / 16 > 5` is false `96 / 16 > 5` is true ``` Because the integer result is rounded, our resize breaks the ratio constraint, this has existed since the beginning, which resulted in us not strictly following the ratio (shrink also has the same issue). This PR change it to multiplication to avoid floating point calculations.	2024-01-18 11:16:50 +02:00
Yanqi Lv	e2b7932b34	Shrink dict when deleting dictEntry (#12850 ) When we insert entries into dict, it may autonomously expand if needed. However, when we delete entries from dict, it doesn't shrink to the proper size. If there are few entries in a very large dict, it may cause huge waste of memory and inefficiency when iterating. The main keyspace dicts (keys and expires), are shrinked by cron (`tryResizeHashTables` calls `htNeedsResize` and `dictResize`), And some data structures such as zset and hash also do that (call `htNeedsResize`) right after a loop of calls to `dictDelete`, But many other dicts are completely missing that call (they can only expand). In this PR, we provide the ability to automatically shrink the dict when deleting. The conditions triggering the shrinking is the same as `htNeedsResize` used to have. i.e. we expand when we're over 100% utilization, and shrink when we're below 10% utilization. Additionally: * Add `dictPauseAutoResize` so that flows that do mass deletions, will only trigger shrinkage at the end. * Rename `dictResize` to `dictShrinkToFit` (same logic as it used to have, but better name describing it) * Rename `_dictExpand` to `_dictResize` (same logic as it used to have, but better name describing it) related to discussion https://github.com/redis/redis/pull/12819#discussion_r1409293878 --------- Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>	2024-01-15 08:20:53 +02:00
Yanqi Lv	c452e414a8	Optimize performance when many clients [p\|s]unsubscribe simultaneously (#12838 ) I'm testing the performance of Pub/Sub command recently. I find if many clients unsubscribe or are killed simultaneously, Redis needs a long time to deal with it. In my experiment, I set 5000 clients and each client subscribes 100 channels. Then I call `client kill type pubsub` to simulate the situation where clients unsubscribe all channels at the same time and calculate the execution time. The result shows that it takes about 23s. I use the _perf_ and find that `listSearchKey` in `pubsubUnsubscribeChannel` costs more than 90% cpu time. I think we can optimize this situation. In this PR, I replace list with dict to track the clients subscribing the channel more efficiently. It changes O(N) to O(1) in the search phase. Then I repeat the experiment as above. The results are as follows. \| \| Execution Time(s) \|used_memory(MB) \| \| :---------------- \| :------: \| :----: \| \| unstable(`1bd0b54`) \| 23.734 \| 65.41 \| \| optimize-pubsub \| 0.288 \| 67.66 \| Thanks for #11595 , I use a no-value dict and the results shows that the performance improves significantly but the memory usage only increases slightly. Notice: - This PR will cause the performance degradation about 20% in `[p\|s]subscribe` command but won't freeze Redis.	2024-01-08 10:32:31 +02:00
debing.sun	ca1f67af80	Make RM_Yield thread-safe (#12905 ) ## Issues and solutions from #12817 1. Touch ProcessingEventsWhileBlocked and calling moduleCount() without GIL in afterSleep() - Introduced: Version: 7.0.0 PR: #9963 - Harm Level: Very High If the module thread calls `RM_Yield()` before the main thread enters afterSleep(), and modifies `ProcessingEventsWhileBlocked`(+1), it will cause the main thread to not wait for GIL, which can lead to all kinds of unforeseen problems, including memory data corruption. - Initial / Abandoned Solution: * Added `__thread` specifier for ProcessingEventsWhileBlocked. `ProcessingEventsWhileBlocked` is used to protect against nested event processing, but event processing in the main thread and module threads should be completely independent and unaffected, so it is safer to use TLS. * Adding a cached module count to keep track of the current number of modules, to avoid having to use `dictSize()`. - Related Warnings: ``` WARNING: ThreadSanitizer: data race (pid=1136) Write of size 4 at 0x0001045990c0 by thread T4 (mutexes: write M0): #0 processEventsWhileBlocked networking.c:4135 (redis-server:arm64+0x10006d124) #1 RM_Yield module.c:2410 (redis-server:arm64+0x10018b66c) #2 bg_call_worker <null>:83232836 (blockedclient.so:arm64+0x16a8) Previous read of size 4 at 0x0001045990c0 by main thread: #0 afterSleep server.c:1861 (redis-server:arm64+0x100024f98) #1 aeProcessEvents ae.c:408 (redis-server:arm64+0x10000fd64) #2 aeMain ae.c:496 (redis-server:arm64+0x100010f0c) #3 main server.c:7220 (redis-server:arm64+0x10003f38c) ``` 2. aeApiPoll() is not thread-safe When using RM_Yield to handle events in a module thread, if the main thread has not yet entered `afterSleep()`, both the module thread and the main thread may touch `server.el` at the same time. - Introduced: Version: 7.0.0 PR: #9963 - Old / Abandoned Solution: Adding a new mutex to protect timing between after beforeSleep() and before afterSleep(). Defect: If the main thread enters the ae loop without any IO events, it will wait until the next timeout or until there is any event again, and the module thread will always hang until the main thread leaves the event loop. - Related Warnings: ``` SUMMARY: ThreadSanitizer: data race ae_kqueue.c:55 in addEventMask ================== ================== WARNING: ThreadSanitizer: data race (pid=14682) Write of size 4 at 0x000100b54000 by thread T9 (mutexes: write M0): #0 aeApiPoll ae_kqueue.c:175 (redis-server:arm64+0x100010588) #1 aeProcessEvents ae.c:399 (redis-server:arm64+0x10000fb84) #2 processEventsWhileBlocked networking.c:4138 (redis-server:arm64+0x10006d3c4) #3 RM_Yield module.c:2410 (redis-server:arm64+0x10018b66c) #4 bg_call_worker <null>:16042052 (blockedclient.so:arm64+0x169c) Previous write of size 4 at 0x000100b54000 by main thread: #0 aeApiPoll ae_kqueue.c:175 (redis-server:arm64+0x100010588) #1 aeProcessEvents ae.c:399 (redis-server:arm64+0x10000fb84) #2 aeMain ae.c:496 (redis-server:arm64+0x100010da8) #3 main server.c:7238 (redis-server:arm64+0x10003f51c) ``` ## The final fix as the comments: https://github.com/redis/redis/pull/12817#discussion_r1436427232 Optimized solution based on the above comment: First, we add `module_gil_acquring` to indicate whether the main thread is currently in the acquiring GIL state. When the module thread starts to yield, there are two possibilities(we assume the caller keeps the GIL): 1. The main thread is in the mid of beforeSleep() and afterSleep(), that is, `module_gil_acquring` is not 1 now. At this point, the module thread will wake up the main thread through the pipe and leave the yield, waiting for the next yield when the main thread may already in the acquiring GIL state. 2. The main thread is in the acquiring GIL state. The module thread release the GIL, yielding CPU to give the main thread an opportunity to start event processing, and then acquire the GIL again until the main thread releases it. This is what https://github.com/redis/redis/pull/12817#discussion_r1436427232 mentioned direction. --------- Co-authored-by: Oran Agra <oran@redislabs.com>	2024-01-07 12:10:29 +02:00
guybe7	12b611b374	WAITAOF: Try to wake blocked clients ASAP in the next beforeSleep (#12627 ) In case server.fsynced_reploff changed (e.g. flushAppendOnly set it to server.master_repl_offset in case there was nothing to fsync) we want to avoid sleeping before the next beforeSleep so we can call blockedBeforeSleep ASAP. without that, in case there's no incoming traffic, we could be waiting for the next cron timer event to wake us up.	2023-12-28 11:27:58 +02:00
Chen Tianjie	8527959598	Replace slots_to_channels radix tree with slot specific dictionaries for shard channels. (#12804 ) We have achieved replacing `slots_to_keys` radix tree with key->slot linked list (#9356), and then replacing the list with slot specific dictionaries for keys (#11695). Shard channels behave just like keys in many ways, and we also need a slots->channels mapping. Currently this is still done by using a radix tree. So we should split `server.pubsubshard_channels` into 16384 dicts and drop the radix tree, just like what we did to DBs. Some benefits (basically the benefits of what we've done to DBs): 1. Optimize counting channels in a slot. This is currently used only in removing channels in a slot. But this is potentially more useful: sometimes we need to know how many channels there are in a specific slot when doing slot migration. Counting is now implemented by traversing the radix tree, and with this PR it will be as simple as calling `dictSize`, from O(n) to O(1). 2. The radix tree in the cluster has been removed. The shard channel names no longer require additional storage, which can save memory. 3. Potentially useful in slot migration, as shard channels are logically split by slots, thus making it easier to migrate, remove or add as a whole. 4. Avoid rehashing a big dict when there is a large number of channels. Drawbacks: 1. Takes more memory than using radix tree when there are relatively few shard channels. What this PR does: 1. in cluster mode, split `server.pubsubshard_channels` into 16384 dicts, in standalone mode, still use only one dict. 2. drop the `slots_to_channels` radix tree. 3. to save memory (to solve the drawback above), all 16384 dicts are created lazily, which means only when a channel is about to be inserted to the dict will the dict be initialized, and when all channels are deleted, the dict would delete itself. 5. use `server.shard_channel_count` to keep track of the number of all shard channels. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2023-12-27 17:40:45 +08:00
Binbin	23e980e77a	Move cliVersion to cli_common and add --version support for redis-check-aof (#10856 ) Let us see which version of redis this tool is part of. Similarly to redis-cli, redis-benchmark and redis-check-rdb redis-rdb-check and redis-aof-check are actually symlinks to redis, so they will directly use getVersion in server, the format became: ``` {title} v={redis_version} sha={sha}:{dirty} malloc={malloc} bits={bits} build={build} ``` Move cliVersion into cli_common, redis-cli and redis-benchmark will use it, and the format is not change: ``` {title} {redis_version} (git:{sha}) ```	2023-12-21 13:51:46 +02:00
zhaozhao.zz	d8a21c5767	Unified db rehash method for both standalone and cluster (#12848 ) After #11695, we added two functions `rehashingStarted` and `rehashingCompleted` to the dict structure. We also registered two handlers for the main database's dict and expire structures. This allows the main database to record the dict in `rehashing` list when rehashing starts. Later, in `serverCron`, the `incrementallyRehash` function is continuously called to perform the rehashing operation. However, currently, when rehashing is completed, `rehashingCompleted` does not remove the dict from the `rehashing` list. This results in the `rehashing` list containing many invalid dicts. Although subsequent cron checks and removes dicts that don't require rehashing, it is still inefficient. This PR implements the functionality to remove the dict from the `rehashing` list in `rehashingCompleted`. This is achieved by adding `metadata` to the dict structure, which keeps track of its position in the `rehashing` list, allowing for quick removal. This approach avoids storing duplicate dicts in the `rehashing` list. Additionally, there are other modifications: 1. Whether in standalone or cluster mode, the dict in database is inserted into the rehashing linked list when rehashing starts. This eliminates the need to distinguish between standalone and cluster mode in `incrementallyRehash`. The function only needs to focus on the dicts in the `rehashing` list that require rehashing. 2. `rehashing` list is moved from per-database to Redis server level. This decouples `incrementallyRehash` from the database ID, and in standalone mode, there is no need to iterate over all databases, avoiding unnecessary access to databases that do not require rehashing. In the future, even if unsharded-cluster mode supports multiple databases, there will be no risk involved. 3. The insertion and removal operations of dict structures in the `rehashing` list are decoupled from `activerehashing` config. `activerehashing` only controls whether `incrementallyRehash` is executed in serverCron. There is no need for additional steps when modifying the `activerehashing` switch, as in #12705.	2023-12-15 10:42:53 +08:00
Guillaume Koenig	967fb3c6e8	Extend rax usage by allowing any long long value (#12837 ) The raxFind implementation uses a special pointer value (the address of a static string) as the "not found" value. It works as long as actual pointers were used. However we've seen usages where long long, non-pointer values have been used. It creates a risk that one of the long long value precisely is the address of the special "not found" value. This commit changes raxFind to return 1 or 0 to indicate elementhood, and take in a new void **value to optionally return the associated value. By extension, this also allow the RedisModule_DictSet/Replace operations to also safely insert integers instead of just pointers.	2023-12-14 14:50:18 -08:00
Chen Tianjie	f9cc25c1dd	Add metric to INFO CLIENTS: pubsub_clients. (#12849 ) In INFO CLIENTS section, we already have blocked_clients and tracking_clients. We should add a new metric showing the number of pubsub connections, which helps performance monitoring and trouble shooting.	2023-12-13 13:44:13 +08:00
Binbin	62419c01db	Handle missing fields in dbSwapDatabases and swapMainDbWithTempDb (#12763 ) The change in dbSwapDatabases seems harmless. Because in non-clustered mode, dbBuckets calculations are strictly accurate and in cluster mode, we only have one DB. Modify it for uniformity (just like resize_cursor). The change in swapMainDbWithTempDb is needed in case we swap with the temp db, otherwise the overhead memory usage of db can be miscalculated. In addition we will swap all fields (including rehashing list), just for completeness (and reduce the chance of surprises in the future). Introduced in #12697.	2023-12-10 17:30:20 +02:00
zhaozhao.zz	9ee1cc33a3	Make the sampling logic in eviction clearer (#12781 ) Additional optimizations for the eviction logic in #11695: To make the eviction logic clearer and decouple the number of sampled keys from the running mode (cluster or standalone). * When sampling in each database, we only care about the number of keys in the current database (not the dicts we sampled from). * If there are a insufficient number of keys in the current database (e.g. 10 times the value of `maxmemory_samples`), we can break out sooner (to avoid looping on a sparse database). * We'll never try to sample the db dicts more times than the number of non-empty dicts in the db (max 1 in non-cluster mode). And it also ensures that each database has a sufficient amount of sampled keys, so even if unsharded-cluster supports multiple databases, there won't be any issues. other changes: 1. keep track of the number of non-empty dicts in each database. 2. move key_count tracking into cumulativeKeyCountAdd rather than all it's callers --------- Co-authored-by: Oran Agra <oran@redislabs.com>	2023-12-06 09:37:24 +08:00
zhaozhao.zz	a1c5171c1d	Fix resize hash tables stuck on the last non-empty slot (#12802 ) Introduced in #11695 . The tryResizeHashTables function gets stuck on the last non-empty slot while iterating through dictionaries. It does not restart from the beginning. The reason for this issue is a problem with the usage of dbIteratorNextDict: /* Returns next dictionary from the iterator, or NULL if iteration is complete. / dict dbIteratorNextDict(dbIterator *dbit) { if (dbit->next_slot == -1) return NULL; dbit->slot = dbit->next_slot; dbit->next_slot = dbGetNextNonEmptySlot(dbit->db, dbit->slot, dbit->keyType); return dbGetDictFromIterator(dbit); } When iterating to the last non-empty slot, next_slot is set to -1, causing it to loop indefinitely on that slot. We need to modify the code to ensure that after iterating to the last non-empty slot, it returns to the first non-empty slot. BTW, function tryResizeHashTables is actually iterating over slots that have keys. However, in its implementation, it leverages the dbIterator (which is a key iterator) to obtain slot and dictionary information. While this approach works fine, but it is not very intuitive. This PR also improves readability by changing the iteration to directly iterate over slots, thereby enhancing clarity.	2023-11-28 18:50:16 +08:00
meiravgri	2e854bccc6	Fix async safety in signal handlers (#12658 ) see discussion from after https://github.com/redis/redis/pull/12453 was merged ---- This PR replaces signals that are not considered async-signal-safe (AS-safe) with safe calls. #### 1. serverLog() and serverLogFromHandler() `serverLog` uses unsafe calls. It was decided that we will avoid `serverLog` calls by the signal handlers when: * The signal is not fatal, such as SIGALRM. In these cases, we prefer using `serverLogFromHandler` which is the safe version of `serverLog`. Note they have different prompts: `serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>` `serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>` * The code was added recently. Calls to `serverLog` by the signal handler have been there ever since Redis exists and it hasn't caused problems so far. To avoid regression, from now we should use `serverLogFromHandler` #### 2. `snprintf` `fgets` and `strtoul`(base = 16) --------> `_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex` The safe version of `snprintf` was taken from [here](`8cfc4ca5e7/src/mc_util.c (L754)`) #### 3. fopen(), fgets(), fclose() --------> open(), read(), close() #### 4. opendir(), readdir(), closedir() --------> open(), syscall(SYS_getdents64), close() #### 5. Threads_mngr sync mechanisms * waiting for the thread to generate stack trace: semaphore --------> busy-wait * `globals_rw_lock` was removed: as we are not using malloc and the semaphore anymore we don't need to protect `ThreadsManager_cleanups`. #### 6. Stacktraces buffer The initial problem was that we were not able to safely call malloc within the signal handler. To solve that we created a buffer on the stack of `writeStacktraces` and saved it in a global pointer, assuming that under normal circumstances, the function `writeStacktraces` would complete before any thread attempted to write to it. However, if threads lag behind, they might access this global pointer after it no longer belongs to the `writeStacktraces` stack, potentially corrupting memory. To address this, various solutions were discussed [here](https://github.com/redis/redis/pull/12658#discussion_r1390442896) Eventually, we decided to create a pipe at server startup that will remain valid as long as the process is alive. We chose this solution due to its minimal memory usage, and since `write()` and `read()` are atomic operations. It ensures that stack traces from different threads won't mix. The stacktraces collection process is now as follows: * Cleaning the pipe to eliminate writes of late threads from previous runs. * Each thread writes to the pipe its stacktrace * Waiting for all the threads to mark completion or until a timeout (2 sec) is reached * Reading from the pipe to print the stacktraces. #### 7. Changes that were considered and eventually were dropped * replace watchdog timer with a POSIX timer: according to [settimer man](https://linux.die.net/man/2/setitimer) > POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API ([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2), [timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.) instead. However, although it is supposed to conform to POSIX std, POSIX timers API is not supported on Mac. You can take a look here at the Linux implementation: [here](`c7562ee135`) To avoid messing up the code, and uncertainty regarding compatibility, it was decided to drop it for now. * avoid using sds (uses malloc) in logConfigDebugInfo It was considered to print config info instead of using sds, however apparently, `logConfigDebugInfo` does more than just print the sds, so it was decided this fix is out of this issue scope. #### 8. fix Signal mask check The check `signum & sig_mask` intended to indicate whether the signal is blocked by the thread was incorrect. Actually, the bit position in the signal mask corresponds to the signal number. We fixed this by changing the condition to: `sig_mask & (1L << (sig_num - 1))` #### 9. Unrelated changes both `fork.tcl `and `util.tcl` implemented a function called `count_log_message` expecting different parameters. This caused confusion when trying to run daily tests with additional test parameters to run a specific test. The `count_log_message` in `fork.tcl` was removed and the calls were replaced with calls to `count_log_message` located in `util.tcl` --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2023-11-23 13:22:20 +02:00
Josh Hershberg	290f376429	Cluster refactor: fn renames + small compilation issue on ubuntu Signed-off-by: Josh Hershberg <yehoshua@redis.com>	2023-11-22 05:54:06 +02:00
Josh Hershberg	98a6c44b75	Cluster refactor: Make clusterState private Move clusterState into cluster_legacy.h. In order to achieve this some "accessor" methods needed to be added to the cluster API and some other minor refactors. Signed-off-by: Josh Hershberg <yehoshua@redis.com>	2023-11-22 05:44:10 +02:00
Binbin	fe36306340	Fix DB iterator not resetting pauserehash causing dict being unable to rehash (#12757 ) When using DB iterator, it will use dictInitSafeIterator to init a old safe dict iterator. When dbIteratorNext is used, it will jump to the next slot db dict when we are done a dict. During this process, we do not have any calls to dictResumeRehashing, which causes the dict's pauserehash to always be > 0. And at last, it will be returned directly in dictRehashMilliseconds, which causes us to have slot dict in a state where rehash cannot be completed. In the "expire scan should skip dictionaries with lot's of empty buckets" test, adding a `keys ` can reproduce the problem stably. `keys ` will call dbIteratorNext to trigger a traversal of all slot dicts. Added dbReleaseIterator and dbIteratorInitNextSafeIterator methods to call dictResetIterator. Issue was introduced in #11695.	2023-11-14 14:28:46 +02:00
zhaozhao.zz	cf6ed3feeb	fix the wrong judgement for activerehashing in standalone mode (#12741 ) Introduced by #11695, the judgement should be dictIsRehashing.	2023-11-09 11:30:50 +02:00
Chen Tianjie	e9f312e087	Change stat_client_qbuf_limit_disconnections to atomic. (#12711 ) In #12476 server.stat_client_qbuf_limit_disconnections was added. It is written in readQueryFromClient, which may be called by multiple threads when io-threads and io-threads-do-reads are turned on. Somehow we missed to make it an atomic variable.	2023-11-01 10:57:24 +08:00
Harkrishn Patro	4145d628b4	Reduce dbBuckets operation time complexity from O(N) to O(1) (#12697 ) As part of #11695 independent dictionaries were introduced per slot. Time complexity to discover total no. of buckets across all dictionaries increased to O(N) with straightforward implementation of iterating over all dictionaries and adding dictBuckets of each. To optimize the time complexity, we could maintain a global counter at db level to keep track of the count of buckets and update it on the start and end of rehashing. --------- Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>	2023-10-27 22:05:40 +03:00
Harkrishn Patro	f3bf8485d8	Fix resize hash table dictionary iterator (#12660 ) Dictionary iterator logic in the `tryResizeHashTables` method is picking the next (incorrect) dictionary while the cursor is at a given slot. This could lead to some dictionary/slot getting skipped from resizing. Also stabilize the test. problem introduced recently in #11695	2023-10-19 13:58:32 +03:00
Vitaly	0270abda82	Replace cluster metadata with slot specific dictionaries (#11695 ) This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data. ## Important changes * Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms. * getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time. * Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree. * scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue. * Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot. * Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading. * DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well. ## Performance This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict. RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load. ## Interface changes * Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS` * Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored. * New RDB version to support the new op code for SLOT information. --------- Co-authored-by: Vitaly Arbuzov <arvit@amazon.com> Co-authored-by: Harkrishn Patro <harkrisp@amazon.com> Co-authored-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-14 23:58:26 -07:00
Madelyn Olson	31c3172d9b	Better standardize around assertions (#12539 ) We use the C standard assert() in various places in the codebase, which requires NDEBUG to be undefined. We introduced the redisassert.h file in order to allow low level files to access the assert that maps to serverPanic, but this was only applied tactically and is not available broadly. This PR removes all usage of the standard library asserts and replaces them with an assert that maps to serverPanic. It makes us immune to accidentally setting the NDEBUG flag preventing assertions. I also marked marked the server asserts as "likely" to not execute. I spot checked various points in the code, and it didn't change the code layout on my x86 mac, but it is more consistent with redisassert.h and seems more correct overall.	2023-10-02 18:58:44 -07:00
Viktor Söderqvist	f924bebd83	Rewrite huge printf calls to smaller ones for readability (#12257 ) In a long printf call with many placeholders, it's hard to see which argument belongs to which placeholder. The long printf-like calls in the INFO and CLIENT commands are rewritten into pairs of (format, argument). These pairs are then rewritten to a single call with a long format string and a long list of arguments, using a macro called FMTARGS. The file `fmtargs.h` is added to the repo. Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>	2023-09-28 09:21:23 +03:00
meiravgri	cc2be63997	Print stack trace from all threads in crash report (#12453 ) In this PR we are adding the functionality to collect all the process's threads' backtraces. ## Changes made in this PR ### introduce threads mngr API The threads mngr API which has 2 abilities: * `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up. * ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every thread in the list to invoke the callback, and returns the output collected by each invocation. Elaborating atomicvar API * `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value * `atomicFlagGetSet` -- Get and set the atomic counter value to 1 ### Always set SIGALRM handler SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the `server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler. To make it available at run time we set SIGALRM handler on server startup. The signal handler name was changed to a more general `sigalrmSignalHandler`. ### Print all the process' threads' stacktraces `logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace. `writeStacktraces()`: * On Linux systems we use the threads manager API to collect the backtraces of all the process' threads. To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories. Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses, its name, and tid. * On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file). ## compatibility notes 1. The threads mngr API is only supported in linux. 2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated alternatives (`gettid()` and `tgkill`) were added in glibc 2.3. ## Output example Each thread backtrace will have the following format: `<tid> <thread_name> [additional_info]` * tid: as read from the `/proc/<redis-server-pid>/tasks` file * thread_name: the tread name as it is registered in the os/ * additional_info: Sometimes we want to add specific information about one of the threads. currently. it is only used to mark the thread that handles the backtraces collection by adding "". In case of crash - this also indicates which thread caused the crash. The handling thread in won't necessarily appear first. ``` ------ STACK TRACE ------ EIP: /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] 67089 redis-server linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790] /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] redis-server :6379(+0x75e0c)[0xaaaac2fe5e0c] redis-server :6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00] redis-server :6379(aeMain+0x24)[0xaaaac2fe7038] redis-server :6379(main+0xe0c)[0xaaaac3001afc] /lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc] redis-server :6379(_start+0x30)[0xaaaac2fe0370] 67093 bio_lazy_free /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67091 bio_close_file /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67092 bio_aof /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67089:signal-handler (1693824528) -------- ```	2023-09-24 09:47:23 +03:00
Chen Tianjie	e3d4b30d09	Add two stats to count client input and output buffer oom. (#12476 ) Add these INFO metrics: * client_query_buffer_limit_disconnections * client_output_buffer_limit_disconnections Sometimes it is useful to monitor whether clients reaches size limit of query buffer and output buffer, to decide whether we need to adjust the buffer size limit or reduce client query payload.	2023-08-30 21:51:14 +03:00
meiravgri	fe47c2027b	Signal handler attributes (#12426 ) This PR purpose is to make the crash report process thread safe. main changes include: 1. `setupSigSegvHandler()` is introduced to initialize the signal handler. This function first initializes the signal handler mutex (if not initialized yet) and then registers the process to the signal handler. 2. sigsegvHandler flags : SA_NODEFER - don't add the signal to the process signal mask. We use this flag because we want to be able to handle a second call to the signal manually. removed SA_RESETHAND: this flag resets the signal handler function upon the first entrance to the registered function. The reason to use this flag is to protect from recursively entering the signal handler by the same thread. But, it also means that if a second thread crashes while handling a signal, the process will be terminated immediately and we won't get the crash report. In this PR we discard this flag. The signal handler guard described below purpose is to solve the above issues. 3. Add a signal handler lock with ERRORCHECK attributes. The lock's purpose is to ensure that only one thread generates a crash report. Once a second thread enters the signal handler it will be blocked. We use the ERRORCHECK lock in order to protect from possible deadlock in case the thread handling the crash gets a signal. In the latest scenario, we log what we have collected until the handler crashed. At the end of the crash report we reset the signal handler SIG_DFL, with no flags, and rethrow the signal to generate a core dump (if enabled) and exit the process. During the work on this PR we wanted to understand the historical reasons for how crash is handled. With respect to the choice of the flag, we believe the SA_RESETHAND was not added for any specific purpose. SA_ONSTACK which is removed here from bugReportEnd(), was originally also set in the initial registration to signal handler, but removed in `3ada43e73`. In addition, it was removed from another location in `deee2c1ef` with the following description, which is also relevant to why it should be removed from bugReportEnd: > it seems to be some valgrind bug with SA_ONSTACK. > SA_ONSTACK seems unneeded since WD is not recursive (SA_NODEFER was removed), > also, not sure if it's even valid without a call to sigaltstack()	2023-08-20 19:16:45 +03:00
zhaozhao.zz	8226f39fb2	do not call handleClientsBlockedOnKeys inside yielding command (#12459 ) Fix the assertion when a busy script (timeout) signal ready keys (like LPUSH), and then an arbitrary client's `allow-busy` command steps into `handleClientsBlockedOnKeys` try wake up clients blocked on keys (like BLPOP). Reproduction process: 1. start a redis with aof `./redis-server --appendonly yes` 2. exec blpop `127.0.0.1:6379> blpop a 0` 3. use another client call a busy script and this script push the blocked key `127.0.0.1:6379> eval "redis.call('lpush','a','b') while(1) do end" 0` 4. user a new client call an allow-busy command like auth `127.0.0.1:6379> auth a` BTW, this issue also break the atomicity of script. This bug has been around for many years, the old versions only have the atomic problem, only 7.0/7.2 has the assertion problem. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-08-05 09:52:03 +03:00
zhaozhao.zz	01eb939a06	update monitor client's memory and evict correctly (#12420 ) A bug introduced in #11657 (7.2 RC1), causes client-eviction (#8687) and INFO to have inaccurate memory usage metrics of MONITOR clients. Because the type in `c->type` and the type in `getClientType()` are confusing (in the later, `CLIENT_TYPE_NORMAL` not `CLIENT_TYPE_SLAVE`), the comment we wrote in `updateClientMemUsageAndBucket` was wrong, and in fact that function didn't skip monitor clients. And since it doesn't skip monitor clients, it was wrong to delete the call for it from `replicationFeedMonitors` (it wasn't a NOP). That deletion could mean that the monitor client memory usage is not always up to date (updated less frequently, but still a candidate for client eviction).	2023-07-25 16:10:38 +03:00

1 2 3 4 5 ...

966 Commits