Bug fixes, some improvements. net-000428 is the first batch, split due to pretty big changes in TCP transmission algorithms. The rest contains further work, not logged until the storm has spent itself. Review ------ *General 1. skb used after free in dummy and whitehole devices. (BUG) 2. fib_sync_up() did not revive routes in time. (BUG, by Andi) 3. ECN support is complete at IP level (routing, tunnels, api) (NEW) 4. ECN support in TCP. (NEW, by Jamal) 5. Several bad IPv6 bugs (reassembly etc) (BUG) 6. pipe optimization (NEW, by Dave). 7. af_unix locking bugs. (BUG) *TCP receiver. TR1. More smart ACK strategy with nagling sender. (TEST) TR2. An attempt of autotuning advertised TCP window. (TEST) TR3. Watchdog timer on prequeue (combined to delack timer). (BUG) *TCP sender. TS1. TCP ack path rectification. (NEW) TS2. TCP fast retransmit and congestion avoidance. (NEW) --------------------------------------------------------------------- net-000601 It is for 2.4.0-test1 (maybe, +ac). VFS threading patch by Al Viro is recommended as well. *General - pipe optimization, present in previous patches is temporarily removed, it could conflict with Al's patch. - af_unix bug fixes are moved to Al's patch and, hence, removed from net. af_unix bug fixes are split to separate af_unix-2.4.0.dif. It is obsoleted by Viro's patches. -------------------------------------------------------------------- net-000607 *General - pipe optimization returns back. - igmp timer races. (BUG, discovered by Andrew) *TCP sender - Do forward retransmits only in recovery state. (BUG-IN-NEW) - Limit amount of retransmitted segments per ACK and do slow-start retransmit even for duplicate ACKs. (NEW) *TCP receiver - Allow even tiny segments to open advertised window. (NEW) -------------------------------------------------------------------- net-000608 *TCP receiver - Collapsing receive queues instead of pruning. (NEW) - Shrinking window clamp, when rmem_alloc hits bound. (NEW) [ STATUS. This means that problem of big skb overhead is finally solved. ] -------------------------------------------------------------------- net-000609 *General - Update to base 2.4.0-test1-ac12 *TCP receiver - Even more aggressive receive queues collapse. (NEW) -------------------------------------------------------------------- net-000612 *General - IPv4/IPv6 defragmenter. * SMP races are resolved. * Well, as by-product it appeared easy to thread it. * By the way, IPv4 defragmenter is edited. It was unreadbale. -------------------------------------------------------------------- net-000614 *General - igmp6 races. [BUG] - IPv4/IPv6 defragmenter: change oversize check. Old IPv4 check was double wrong: did not detect oversize sometimes and killed valid packets sometimes because of wrong accounting of length of IP options. [BUG] - I forgot to replace CONFIG_TCP_ECN with CONFIG_INET_ECN in one place. [BUG] *TCP receiver. - SACKs were not always removed in time. See tcp_remove_sacks() [BUG] -------------------------------------------------------------------- net-000615 *General - Kill ip_done() in defragmenter, the check reduces to plain comparison of target length and actually arrived length. Great! *TCP receiver. - D-SACK. [NEW] - Plus SACK code became cleaner. Particularly, several (not-crucial) bugs are fixed. [BUG] *TCP sender. - ECE ACKs _does_ increment CWND! ECN draft is pretty obscure and messy, but it is pretty clear that CWND is not incremented only in congestion avoidance phase (normal in ECN). Slow start must be made not dpending of ECE. - Some bits of D-SACK. Note that receiver side _must_ be complete (it is now), but sender side can be tuned gradually then. D-SACK opens so much of possibilities to detect and to recover from different network pathalogies (reordering, duplication, ACK loss etc.), not addressed earlier, that it will take lots of time to use all this power. For now we do only reordering detection (was made earlier) and account for duplicate retransmits in in_flight. -------------------------------------------------------------------- net-000616 * TCP receiver - Some bugs in D-SACK are found. - Clamping for ATO (TCP_ATO_MAX) is removed. It has lost its meaning. -------------------------------------------------------------------- net-000617 *General - IPv6: oops, binding address was not used by datagram socket! [BUG] - FIONREAD for raw and packet sockets. [BUG in fact] - IPv6 reassembly algorithm is borrowed from IPv4. IPv6's one was buggy, when fragments overlapped. [BUG] - SO_TIMESTAMP [NEW] *TCP receiver - Small rcv_mss is allowed, when tinygrams arrive without PSH. [TEST] RATIONALE: PSH is set, when complete record is written out, right? Hence, PSHless segments are _always_ MSS sized, except for the fragments generated by SWS avoidance override. *TCP sender - Do not clear SACK on timeout. Clear only head of queue to start retransmission. If OFO queue was dropped, it will be detected soon by reneging detection. RFC allows this. - SACKED_RETRANS flag is cleared sometimes, hence TSless SACK TCP violates Karn's rule for RTT calculation. Add involatile EVER_RETRANS tag. [BUG] - Some D-SACK modifications to reordering detection. Probably wrong, in test. [TEST] -------------------------------------------------------------------- net-000618 *General - Bug in icmp.c, it should not rely on skb->len. [BUG] - Defragmenter uses skb->cb. [NEW] *TCP sender - Set PSH on frames, sent due to SWS avoidance override. If receiver does not want to open window, we can help him a bit. [TEST] -------------------------------------------------------------------- net-000627 [ In the gap some updates were made gradually, but they were not logged. Also, partial merge to vger occured. ] *General - ECN is restructured. - Checksum update routines were broken. Old report by Arthur Skawina. [BUG] - Confusing "neighbor table overflow" is removed. By Werner. [BUG] - Update to ip-sysctl.txt and to config files. - CONFIG_SKB_LARGE is killed. - ip_fragment.c is aware of hardware checksumming. * TCP sender - "rsync lockup" bug. RFC window update algorithm _requires_ support for shrinking window, because RFC window updates may occur through artifitial inflation and subsequnet deflation of window. It is surprize. 8) Some workaround is proposed to avoid this: namely, holding right edge fixed, when SND.UNA moves right but update to SND.WND is rejected by RFC rules. It should work even if we do not understand window shrinking, but I am still not sure for all 100%. - Some minimal support for receivers shrinking window. It is still not correct (times out, rather than waits for window open, if window is shrunk to zero), but should be mostly sane. At least, invalid segments are not continuosly retransmitted more, zero window probes are sent in time and legal sequence numbers are used for ACKing. So, "rsync lockup" may be avoided now even without "holding window" trick. - D-SACK sender side is complete in the extent, which we are able now. Additional undo heuristccs, based on D-SACK. Esentially, it means that reordering does not affect throughput at all, provide receiver window is large enough. - One more undo heuristics based on timestamps is made when parial ACK arrives. Hoe phase may be avoided now. This should help wireless folks, when Hoe extension is big pain. * TCP generic. - tcp_input.c became too large. An attempt of split: tcp_minisocks.c contains timewait bucket and syn request code. It is still 100K... -------------------------------------------------------------------- net-000702 * TCP sender. - Window update algorithm stabilized to "the only correct one". - Window shrinking hacks are withdrawn as too ugly ones. -------------------------------------------------------------------- net-000705 * General. - Zero length read() on af_unix sockets. Return 2.2 behaviour. * TCP generic - Zero length read() should look like af_unix. [Old BUG] - IPv6 TCP input routine violated assumption on skb sharing. [BUG!] - More (much more) statistics on exceptional conditions. * TCP sender. - Send PSH at least once each window to keep buggy windows receivers happy. [BUG, in fact] - Move queued memory accounting inside TCP. It is both faster and allows to play with sndbuf autotuning. - Be more liberal on initial cwnd: take into account ssthresh, when it is known. -------------------------------------------------------------------- net-000706 * TCP sender. - Big win. sk->sndbuf autotuning in spirit of PSC. No min-max balancing, but autoslection still sane and allows to work with arbitrary link powers not harming VM and security. First tests shows it behaves sane. [ Oh, lucky day! ~100Kbyte/sec to vger. no losses and nice, ugly reordering with metric up to 8! Guys, TCP did not _ANY_ mistakes in reordering detection. emacs-19.29 ftped five times almost without retransmissions and all the restransmissions are undone. I do not believe to this fortune. 8) ] -------------------------------------------------------------------- net-000717 All the chunks (out of TCP) are commited. But new bugs appeared since that time: - Bug in icmp.c, IP header length was added twice. (Rusty). - tcp_tw_recycle kills masqueraded clients, turn it off. (Andi). So, the shop is _closed_. Last makeup is cleaning TCP timers to move timer logic under socket lock. Namely: - timers' state and timeout values are kept separately, out of timer struct and changed only under socket lock, so that we always know their real state. - probe_timer is killed, because it is not required after this cleanup. [ One day, when we will do timed transmissions, it will return in some different form. ] - Real timers are never cleared. It was useless loss of time both on unidirectional and transaction like connections. The only case, when new scheme is worse, is almost idle keepalive-like session. Well, if it is almost idle, timer will not add significant overhead in any case. Yes. It appeared that congestion window validation is tightly connected to buffer autotuning. Essentially this connection shows hole both in RFC2861 and in PSC autotuning: network starvation can mean both too small sndbuf and real application limit. The solution becomes natural, when we put both algorithms together: sndbuf is expanded, when we have seen throttle by memory limit recently. So, RFC2861 is implemented as well, it is the cheapest way to resolve the conflict. -------------------------------------------------------------------- net-000729 *General - bug in proxy neighbour table cleanup. (Marc Boucher) [BUG] I hope it will not be lost this time. 8) - bugs in tunnels introduced with netfilet hooks. [BUG] Do not forget to remind Rusty again, that he should be more accurate. Sheeit! Something inside netfilter is _very_ broken, it mangles or drops packets even if no rules are configured. It is enough to compile it to break networking. Multiple reports. *TCP receiver - Increase default rcvbuf and window scale calculation to advertise maximal window by default. -------------------------------------------------------------------- net-000731 *General - Some fixes from Rasmus Andersen (compiler warnings) *TCP - Some cleanups in TCP sysctl, tcp_*mem are grouped to triples. Added tcp_app_win and tcp_adv_win_scale instead of hardwired definitions. ip-sysctl.txt is updated. - Final cleanup of tcp_mem_schedule. - Growing rcvbuf on overcommit due to ofo segments. Removed 2*rcvbuf hack, it should not be necessary with new tcp rcvbuf manager. -------------------------------------------------------------------- net-000801 *TCP sender - Do not apply Floyd block to SACK TCP, it is defended by SACKs. - Funny thing, noticed by Andrey Gurtov. Linux-2.2/4 used to burst after slow start retransmit. Khm... I am not sure, that it is worth to fix, but it looked really disguisting. -------------------------------------------------------------------- net-000803 *TCP - Alan's schedule_timeo(1) hack. Whew... *TCP sender - Do slow start after loss, like BSD teaches. Problem noticed by Andrey Gurtov (see previous snaphost) disappears. -------------------------------------------------------------------- net-000804 *TCP receiver - calculate initial rcvmss more accurately, it is possible. *TCP sender - Improvement of yesterday makes undoing false loss trivial. Damn, I again invented bicycle. Undoing retransmits using timestamps, which I supposed something original, appeared well-known under name of "Eifel algorithm". -------------------------------------------------------------------- net-000805 *General - BUG in neighbour.c (sdoyon@vipswitch.COM) - BUG in binding IPv6 sockets. yoshfuji@v6.linux.or.JP (Hideaki YOSHIFUJI) *TCP. - Accept reset with text on zero window. [ NOTE. tcp_sequence() is totally broken and will require rewrite soon or later. Well, it is not so importnat, so that it is better to make later, than soon. RST case is really important. ] *TCP sender. - Text edit (semantic one 8)) in tcp_fastretrans_alert() to make it cleaner. -------------------------------------------------------------------- net-000808 *General - AF_UNIX, SOCK_DRGAM. Make poll() on write working at least on connected sockets. [BUG in fact] - Improved packet socket. Now it may combine ring operation and user copy. Seems, this is _finally_ optimal solution, only zero copy is more clever. 8) - /proc/net/tcp etc. did illegal references to sk->socket. [BUG!] Do not forget to fix this for another PF_*! *TCP - Check th->doff for validity!!! Dave fixed this in 2.2, but 2.3 remained buggy. [BUG!] - Edit continued. Seems, now it can be sold without shame. 8) -------------------------------------------------------------------- net-000809 *** net-000808 has been merged to vger. This patch is incremental. - last moment fixes. -------------------------------------------------------------------- net-000810 *** net-000809 has been merged to vger too. This patch is incremental again. - GRED patch arrived from Jamal. - disable debugging messages in TCP. - removed rcv_small/rcv_thresh heurisitcs. --------------------------------------------------------------------- BUGS WAITING FOR A FIX. #1. IPv6 loses all-nodes multicast address at interface down. #2. Timer races in IPv6/IPv4 defragmenter. [repaired 000612, me] #3. Lots of races in igmp6. [repaired 000614, me] #4. CBQ frees something twice. 2.2 is buggy too. #5. All the timers in net/sched/sch_*.c are racy with SMP. Alexey Kuznetsov kuznet@ms2.inr.ac.ru