Archive for the ‘互联网’ Category

宏观架构,关注系统层面的设计。

《面向模式的软件体系结构》共5卷,集大成者,内容非常全面(虽然略显陈旧),对于想要建立架构观念的同学非常好。我看的时候,已经绝版,还是从淘宝买的复印版,翻译不好,但望文生义地猜去,仍然学到不少。目前amazon上有前3卷可以买。

架构之美》,几个架构实例,从企业级到互联网,覆盖了需求分析、技术选型、折衷、架构设计,思考和陈述方式值得借鉴。

架构实战 – 软件架构设计的过程》,不过多涉及技术,而是通过实例完整列出架构过程,如何开始、各阶段参与的角色、产出啥、如何验证等等,对于刚开始做架构的同学,至少知道按图索骥,不至于落了东西。

思考软件,创新设计 – A端架构师的思考技术》,高焕堂出品。在我苦闷的思考,业务架构师价值、如何才能牛X的时候,给了一盏还算明亮的小灯。读来轻松,可以一试。

微观架构,关注代码层面的设计。

重构 – 改善既有代码的设计》,大神Martin Fowler的著作,必须一读。这本书让我爱上重构,知道什么是漂亮的代码,也在对代码的不断优化中进步。

设计模式》,如此经典的书籍,没看过2、3遍都不好意思吧?首先,说明了平时编码中用到的那些小技术都是啥;而一旦完成抽象这一步,就可以在大脑里比较各种实现方式的优劣,完成模式选型。同时,也便于交流,说一个生产消费就知道要干啥了。

代码整洁之道》,还是讲什么是好代码的书。

《代码大全》,很多人推崇的书,巨厚。说实话没完整阅读过。

不管是架构师,还是工程师,代码是看家本领,所以以上几本书还是建议仔细、反复阅读,并在coding当中理解。

其他补充推荐

架构师除了关注技术,流程、可测试性、人、资源协调等等都需要cover到,所以再补充几本书。

人月神话》,经典书籍,不管懂不懂都值得一读。

《JUnit实战》、《渗透测试实践指南》等,没看完,想强调的是,架构师得了解并关注测试。

UNIX编程艺术》,你可能不写C,但UNIX的设计思想,强调至简至美的调调,绝对值得学习,尤其这本书还很好笑。

金字塔原理》,作为架构师,可能会经常写PPT、文档,与人沟通,这本书教会我说人话。

再来本打鸡血的书,让我们缅怀着黑客的美好岁月前行:

黑客》,前半本就是硬件与自由软件的黄金岁月,后半本我读不下去。

最后,我认为,一个好的架构师,除了耍嘴皮,必然也对所在领域有非常深入的了解,所以除了这里推荐的架构书籍,请务必深挖你本身就擅长的技术。

 

昨天升级了wordpress到新版本,结果Developer Formatter在插入代码时失败,第一个报错信息是:

 Javascript |  copy code |? 
1
Uncaught TypeError: undefined is not a function

firebug排查定位其错误应该是在wp-content/plugins/devformatter/devinterface.php生成的js文件,调用execInstanceCommand方法出错。

解决方法如下:

#vim wp-content/plugins/devformatter/devinterface.php,找到execInstanceCommand哪一行,修改为:

 Javascript |  copy code |? 
1
      if(HtmlEditor){
2
        edInsertContent(edCanvas, DevFmt_ContentStart + DevFmt_TheContent + DevFmt_ContentEnd);
3
      }else{
4
        alert(DevFmt_ContentStart + DevFmt_TheContent + DevFmt_ContentEnd);
5
        tinyMCE.execCommand('mceReplaceContent', false,
6
          switchEditors.wpautop(DevFmt_ContentStart + DevFmt_TheContent + DevFmt_ContentEnd));
7
      }

第二个问题是,插入带空格的代码后,页面上出现大量DVFMTSC字样,修改wp-content/plugins/devformatter/devfmt_editor.js文件如下:

 Javascript |  copy code |? 
1
2
block = block.replace(/{{DVFMTSC}}/gi, '<!--DVFMTSC-->&').replace(/\n/gi, "<br />");
3
 
4
修改为:
5
block = block.replace(/{{DVFMTSC}}/gi, '&').replace(/\n/gi, "<br />");

参考:

  • http://stackoverflow.com/questions/22813970/typeerror-window-tinymce-execinstancecommand-is-not-a-function

sar命令

我很喜欢用这个命令来查看/监控系统的整体情况。以下引自http://linux.die.net/man/1/sar,并加以解释。

常用参数组合

查看内存

free, sar -r 1, sar -B 1,ps aux

查看cpu和负载

sar -u ALL 1 , sar -P ALL 1,  sar -q 1 , sar -w 1

查看磁盘

sar -b 1,  sar -d 1, sar -v 1

查看网络

sar -n DEV等

man手册及解释

Name

sar – Collect, report, or save system activity information.

Synopsis

sar [ -A ] [ -b ] [ -B ] [ -C ] [ -d ] [ -h ] [ -i interval ] [ -m ] [ -p ] [ -q ] [ -r ] [ -R ] [ -S ] [ -t ] [ -u [ ALL ] ] [ -v ] [ -V ] [ -w ] [ -W ] [ -y ] [ -n { keyword [,…] | ALL } ] [ -I { int[,…] | SUM | ALL | XALL } ] [ -P { cpu [,…] | ALL } ] [ -o [ filename ] | -f [ filename ] ] [ -s [ hh:mm:ss ] ] [ -e [hh:mm:ss ] ] [ interval [ count ] ]

Description

The sar command writes to standard output the contents of selected cumulative activity counters in the operating system. The accounting system, based on the values in the count and interval parameters, writes information the specified number of times spaced at the specified intervals in seconds. If the interval parameter is set to zero, the sar command displays the average statistics for the time since the system was started. If the interval parameter is specified without the count parameter, then reports are generated continuously. The collected data can also be saved in the file specified by the -o filename flag, in addition to being displayed onto the screen. If filename is omitted, sar uses the standard system activity daily data file, the /var/log/sa/sadd file, where the dd parameter indicates the current day. By default all the data available from the kernel are saved in the data file.

The sar command extracts and writes to standard output records previously saved in a file. This file can be either the one specified by the -f flag or, by default, the standard system activity daily data file.

Without the -P flag, the sar command reports system-wide (global among all processors) statistics, which are calculated as averages for values expressed as percentages, and as sums otherwise. If the -P flag is given, the sar command reports activity which relates to the specified processor or processors. If -P ALL is given, the sar command reports statistics for each individual processor and global statistics among all processors.  可以查看所有CPU的整体情况(不带-P参数),也可以查看指定处理器(-P CPU-NUM),还可以查看指定CPU和汇总情况(-P ALL)。对于应用层而言,可以查看程序是否均衡使用了多CPU并行能力。

You can select information about specific system activities using flags. Not specifying any flags selects only CPU activity. Specifying the -A flag is equivalent to specifying -bBdqrRSvwWy -I SUM -I XALL -n ALL -u ALL -P ALL.

The default version of the sar command (CPU utilization report) might be one of the first facilities the user runs to begin system activity investigation, because it monitors major system resources. If CPU utilization is near 100 percent (user + nice + system), the workload sampled is CPU-bound.

If multiple samples and multiple reports are desired, it is convenient to specify an output file for the sar command. Run thesar command as a background process. The syntax for this is:

sar -o datafile interval count >/dev/null 2>&1 &

All data is captured in binary form and saved to a file (datafile). The data can then be selectively displayed with the sarcommand using the -f option. Set the interval and count parameters to select count records at interval second intervals. If the count parameter is not set, all the records saved in the file will be selected. Collection of data in this manner is useful to characterize system usage over a period of time and determine peak usage hours.

Note: The sar command only reports on local activities.

Options

-A

This is equivalent to specifying -bBdqrRSuvwWy -I SUM -I XALL -n ALL -u ALL -P ALL.

-b

Report I/O and transfer rate statistics. The following values are displayed:  物理存储介质的I/O监控。

tps

Total number of transfers per second that were issued to physical devices. A transfer is an I/O request to a physical device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.

对物理存储每秒发起的读写请求数目。以下细分为读请求数目rtps,写请求数目wtps。

rtps

Total number of read requests per second issued to physical devices.

wtps

Total number of write requests per second issued to physical devices.

bread/s

Total amount of data read from the devices in blocks per second. Blocks are equivalent to sectors with 2.4 kernels and newer and therefore have a size of 512 bytes. With older kernels, a block is of indeterminate size.

每秒读入的blocks数目,2.4内核及以上,block size = sector size = 512B。下面可以看到写blocks数目。

bwrtn/s

Total amount of data written to devices in blocks per second.

-B

Report paging statistics. Some of the metrics below are available only with post 2.5 kernels. The following values are displayed: 页交换相关数据。

pgpgin/s

Total number of kilobytes the system paged in from disk per second. Note: With old kernels (2.2.x) this value is a number of blocks per second (and not kilobytes).

2.2内核以上,是每秒从磁盘换入的数据量,以KB为单位。

pgpgout/s

Total number of kilobytes the system paged out to disk per second. Note: With old kernels (2.2.x) this value is a number of blocks per second (and not kilobytes).

2.2内核以上,是每秒从磁盘换出的数据量,以KB为单位。

fault/s

Number of page faults (major + minor) made by the system per second. This is not a count of page faults that generate I/O, because some page faults can be resolved without I/O.

每秒系统产生的页中断数目。注意,页中断不一定导致I/O。下面的major faults才必定会导致从disk加载数据到memory。

majflt/s

Number of major faults the system has made per second, those which have required loading a memory page from disk.

每秒系统产生的major中断数目,会导致从disk加载内存页。

pgfree/s

Number of pages placed on the free list by the system per second.

每秒系统释放的空闲内存页数目。

pgscank/s

Number of pages scanned by the kswapd daemon per second.

每秒被kswapd扫描的内存页数目。

pgscand/s

Number of pages scanned directly per second.

每秒被直接扫描的内存页数目。

pgsteal/s

Number of pages the system has reclaimed from cache (pagecache and swapcache) per second to satisfy its memory demands.

每秒系统认为内存不足的次数,包括页内存和swap内存。

%vmeff

Calculated as pgsteal / pgscan, this is a metric of the efficiency of page reclaim. If it is near 100% then almost every page coming off the tail of the inactive list is being reaped. If it gets too low (e.g. less than 30%) then the virtual memory is having some difficulty. This field is displayed as zero if no pages have been scanned during the interval of time.

-C

When reading data from a file, tell sar to display comments that have been inserted by sadc.

-d

Report activity for each block device (kernels 2.4 and newer only). When data is displayed, the device specificationdev m-n is generally used ( DEV column). m is the major number of the device. With recent kernels (post 2.5), n is the minor number of the device, but is only a sequence number with pre 2.5 kernels. Device names may also be pretty-printed if option -p is used (see below). Values for fields avgqu-sz, await, svctm and %util may be unavailable and displayed as 0.00 with some 2.4 kernels. Note that disk activity depends on sadc options “-S DISK” and “-S XDISK” to be collected. The following values are displayed:

tps

Indicate the number of transfers per second that were issued to the device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.

与-b选项的tps含义相同。

rd_sec/s

Number of sectors read from the device. The size of a sector is 512 bytes.

与-b选项的bread/s含义相同。

wr_sec/s

Number of sectors written to the device. The size of a sector is 512 bytes.

与-b选项的wread/s含义相同。

avgrq-sz

The average size (in sectors) of the requests that were issued to the device.

每个请求的平均传输数据量,以sectors为单位。sector size = 512B。

avgqu-sz

The average queue length of the requests that were issued to the device.

请求排队的平均长度。如果出现磁盘等硬件问题,排队长度极大可能性会剧烈上升。

await

The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.

请求平均处理时间,单位是ms。包含了排队时间和真实的处理时间。

svctm

The average service time (in milliseconds) for I/O requests that were issued to the device.

请求平均的真实处理时间,单位是ms。与await相减就可以得到排队时间了。在我们的hbase服务器上,await – svctm会达到3ms+-,而负载较轻的web服务器只有0.0几ms。

%util

Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

CPU时间中用于IO耗时的占比。越趋近100%,代表设备使用越饱和。

-e [ hh:mm:ss ]

Set the ending time of the report. The default ending time is 18:00:00. Hours must be given in 24-hour format. This option can be used when data are read from or written to a file (options -f or -o ).

-f [ filename ]

Extract records from filename (created by the -o filename flag). The default value of the filename parameter is the current daily data file, the /var/log/sa/sadd file. The -f option is exclusive of the -o option.

-h

Display a short help message then exit.

-i interval

Select data records at seconds as close as possible to the number specified by the interval parameter.

-I { int [,…] | SUM | ALL | XALL }

Report statistics for a given interrupt. int is the interrupt number. Specifying multiple -I int parameters on the command line will look at multiple independent interrupts. The SUM keyword indicates that the total number of interrupts received per second is to be displayed. The ALL keyword indicates that statistics from the first 16 interrupts are to be reported, whereas the XALL keyword indicates that statistics from all interrupts, including potential APIC interrupt sources, are to be reported. Note that interrupt statistics depend on sadc option “-S INT” to be collected.

-m

Report power management statistics. Note that these statistics depend on sadc option “-S POWER” to be collected. The following value is displayed:

MHz

CPU clock frequency in MHz.

-n { keyword [,…] | ALL }

Report network statistics.

Possible keywords are DEVEDEVNFSNFSDSOCKIPEIPICMPEICMPTCPETCPUDPSOCK6IP6,EIP6ICMP6EICMP6 and UDP6.

With the DEV keyword, statistics from the network devices are reported. The following values are displayed:

监控网络情况,简单使用方式:sar -n DEV 1 1

IFACE

Name of the network interface for which statistics are reported.

rxpck/s

Total number of packets received per second.

txpck/s

Total number of packets transmitted per second.

rxkB/s

Total number of kilobytes received per second.

txkB/s

Total number of kilobytes transmitted per second.

rxcmp/s

Number of compressed packets received per second (for cslip etc.).

txcmp/s

Number of compressed packets transmitted per second.

rxmcst/s

Number of multicast packets received per second.

With the EDEV keyword, statistics on failures (errors) from the network devices are reported. The following values are displayed:

IFACE

Name of the network interface for which statistics are reported.

rxerr/s

Total number of bad packets received per second.

txerr/s

Total number of errors that happened per second while transmitting packets.

coll/s

Number of collisions that happened per second while transmitting packets.

rxdrop/s

Number of received packets dropped per second because of a lack of space in linux buffers.

txdrop/s

Number of transmitted packets dropped per second because of a lack of space in linux buffers.

txcarr/s

Number of carrier-errors that happened per second while transmitting packets.

rxfram/s

Number of frame alignment errors that happened per second on received packets.

rxfifo/s

Number of FIFO overrun errors that happened per second on received packets.

txfifo/s

Number of FIFO overrun errors that happened per second on transmitted packets.

With the NFS keyword, statistics about NFS client activity are reported. The following values are displayed:

call/s

Number of RPC requests made per second.

retrans/s

Number of RPC requests per second, those which needed to be retransmitted (for example because of a server timeout).

read/s

Number of ‘read’ RPC calls made per second.

write/s

Number of ‘write’ RPC calls made per second.

access/s

Number of ‘access’ RPC calls made per second.

getatt/s

Number of ‘getattr’ RPC calls made per second.

With the NFSD keyword, statistics about NFS server activity are reported. The following values are displayed:

scall/s

Number of RPC requests received per second.

badcall/s

Number of bad RPC requests received per second, those whose processing generated an error.

packet/s

Number of network packets received per second.

udp/s

Number of UDP packets received per second.

tcp/s

Number of TCP packets received per second.

hit/s

Number of reply cache hits per second.

miss/s

Number of reply cache misses per second.

sread/s

Number of ‘read’ RPC calls received per second.

swrite/s

Number of ‘write’ RPC calls received per second.

saccess/s

Number of ‘access’ RPC calls received per second.

sgetatt/s

Number of ‘getattr’ RPC calls received per second.

With the SOCK keyword, statistics on sockets in use are reported (IPv4). The following values are displayed:

totsck

Total number of sockets used by the system.

tcpsck

Number of TCP sockets currently in use.

udpsck

Number of UDP sockets currently in use.

rawsck

Number of RAW sockets currently in use.

ip-frag

Number of IP fragments currently in use.

tcp-tw

Number of TCP sockets in TIME_WAIT state.

With the IP keyword, statistics about IPv4 network traffic are reported. Note that IPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

irec/s

The total number of input datagrams received from interfaces per second, including those received in error [ipInReceives].

fwddgm/s

The number of input datagrams per second, for which this entity was not their final IP destination, as a result of which an attempt was made to find a route to forward them to that final destination [ipForwDatagrams].

idel/s

The total number of input datagrams successfully delivered per second to IP user-protocols (including ICMP) [ipInDelivers].

orq/s

The total number of IP datagrams which local IP user-protocols (including ICMP) supplied per second to IP in requests for transmission [ipOutRequests]. Note that this counter does not include any datagrams counted in fwddgm/s.

asmrq/s

The number of IP fragments received per second which needed to be reassembled at this entity [ipReasmReqds].

asmok/s

The number of IP datagrams successfully re-assembled per second [ipReasmOKs].

fragok/s

The number of IP datagrams that have been successfully fragmented at this entity per second [ipFragOKs].

fragcrt/s

The number of IP datagram fragments that have been generated per second as a result of fragmentation at this entity [ipFragCreates].

With the EIP keyword, statistics about IPv4 network errors are reported. Note that IPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

ihdrerr/s

The number of input datagrams discarded per second due to errors in their IP headers, including bad checksums, version number mismatch, other format errors, time-to-live exceeded, errors discovered in processing their IP options, etc. [ipInHdrErrors]

iadrerr/s

The number of input datagrams discarded per second because the IP address in their IP header’s destination field was not a valid address to be received at this entity. This count includes invalid addresses (e.g., 0.0.0.0) and addresses of unsupported Classes (e.g., Class E). For entities which are not IP routers and therefore do not forward datagrams, this counter includes datagrams discarded because the destination address was not a local address [ipInAddrErrors].

iukwnpr/s

The number of locally-addressed datagrams received successfully but discarded per second because of an unknown or unsupported protocol [ipInUnknownProtos].

idisc/s

The number of input IP datagrams per second for which no problems were encountered to prevent their continued processing, but which were discarded (e.g., for lack of buffer space) [ipInDiscards]. Note that this counter does not include any datagrams discarded while awaiting re-assembly.

odisc/s

The number of output IP datagrams per second for which no problem was encountered to prevent their transmission to their destination, but which were discarded (e.g., for lack of buffer space) [ipOutDiscards]. Note that this counter would include datagrams counted in fwddgm/s if any such packets met this (discretionary) discard criterion.

onort/s

The number of IP datagrams discarded per second because no route could be found to transmit them to their destination [ipOutNoRoutes]. Note that this counter includes any packets counted in fwddgm/s which meet this ‘no-route’ criterion. Note that this includes any datagrams which a host cannot route because all of its default routers are down.

asmf/s

The number of failures detected per second by the IP re-assembly algorithm (for whatever reason: timed out, errors, etc) [ipReasmFails]. Note that this is not necessarily a count of discarded IP fragments since some algorithms can lose track of the number of fragments by combining them as they are received.

fragf/s

The number of IP datagrams that have been discarded per second because they needed to be fragmented at this entity but could not be, e.g., because their Don’t Fragment flag was set [ipFragFails].

With the ICMP keyword, statistics about ICMPv4 network traffic are reported. Note that ICMPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

imsg/s

The total number of ICMP messages which the entity received per second [icmpInMsgs]. Note that this counter includes all those counted by ierr/s.

omsg/s

The total number of ICMP messages which this entity attempted to send per second [icmpOutMsgs]. Note that this counter includes all those counted by oerr/s.

iech/s

The number of ICMP Echo (request) messages received per second [icmpInEchos].

iechr/s

The number of ICMP Echo Reply messages received per second [icmpInEchoReps].

oech/s

The number of ICMP Echo (request) messages sent per second [icmpOutEchos].

oechr/s

The number of ICMP Echo Reply messages sent per second [icmpOutEchoReps].

itm/s

The number of ICMP Timestamp (request) messages received per second [icmpInTimestamps].

itmr/s

The number of ICMP Timestamp Reply messages received per second [icmpInTimestampReps].

otm/s

The number of ICMP Timestamp (request) messages sent per second [icmpOutTimestamps].

otmr/s

The number of ICMP Timestamp Reply messages sent per second [icmpOutTimestampReps].

iadrmk/s

The number of ICMP Address Mask Request messages received per second [icmpInAddrMasks].

iadrmkr/s

The number of ICMP Address Mask Reply messages received per second [icmpInAddrMaskReps].

oadrmk/s

The number of ICMP Address Mask Request messages sent per second [icmpOutAddrMasks].

oadrmkr/s

The number of ICMP Address Mask Reply messages sent per second [icmpOutAddrMaskReps].

With the EICMP keyword, statistics about ICMPv4 error messages are reported. Note that ICMPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

ierr/s

The number of ICMP messages per second which the entity received but determined as having ICMP-specific errors (bad ICMP checksums, bad length, etc.) [icmpInErrors].

oerr/s

The number of ICMP messages per second which this entity did not send due to problems discovered within ICMP such as a lack of buffers [icmpOutErrors].

idstunr/s

The number of ICMP Destination Unreachable messages received per second [icmpInDestUnreachs].

odstunr/s

The number of ICMP Destination Unreachable messages sent per second [icmpOutDestUnreachs].

itmex/s

The number of ICMP Time Exceeded messages received per second [icmpInTimeExcds].

otmex/s

The number of ICMP Time Exceeded messages sent per second [icmpOutTimeExcds].

iparmpb/s

The number of ICMP Parameter Problem messages received per second [icmpInParmProbs].

oparmpb/s

The number of ICMP Parameter Problem messages sent per second [icmpOutParmProbs].

isrcq/s

The number of ICMP Source Quench messages received per second [icmpInSrcQuenchs].

osrcq/s

The number of ICMP Source Quench messages sent per second [icmpOutSrcQuenchs].

iredir/s

The number of ICMP Redirect messages received per second [icmpInRedirects].

oredir/s

The number of ICMP Redirect messages sent per second [icmpOutRedirects].

With the TCP keyword, statistics about TCPv4 network traffic are reported. Note that TCPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

active/s

The number of times TCP connections have made a direct transition to the SYN-SENT state from the CLOSED state per second [tcpActiveOpens].

passive/s

The number of times TCP connections have made a direct transition to the SYN-RCVD state from the LISTEN state per second [tcpPassiveOpens].

iseg/s

The total number of segments received per second, including those received in error [tcpInSegs]. This count includes segments received on currently established connections.

oseg/s

The total number of segments sent per second, including those on current connections but excluding those containing only retransmitted octets [tcpOutSegs].

With the ETCP keyword, statistics about TCPv4 network errors are reported. Note that TCPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

atmptf/s

The number of times per second TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times per second TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state [tcpAttemptFails].

estres/s

The number of times per second TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state [tcpEstabResets].

retrans/s

The total number of segments retransmitted per second – that is, the number of TCP segments transmitted containing one or more previously transmitted octets [tcpRetransSegs].

isegerr/s

The total number of segments received in error (e.g., bad TCP checksums) per second [tcpInErrs].

orsts/s

The number of TCP segments sent per second containing the RST flag [tcpOutRsts].

With the UDP keyword, statistics about UDPv4 network traffic are reported. Note that UDPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

idgm/s

The total number of UDP datagrams delivered per second to UDP users [udpInDatagrams].

odgm/s

The total number of UDP datagrams sent per second from this entity [udpOutDatagrams].

noport/s

The total number of received UDP datagrams per second for which there was no application at the destination port [udpNoPorts].

idgmerr/s

The number of received UDP datagrams per second that could not be delivered for reasons other than the lack of an application at the destination port [udpInErrors].

With the SOCK6 keyword, statistics on sockets in use are reported (IPv6). Note that IPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed:

tcp6sck

Number of TCPv6 sockets currently in use.

udp6sck

Number of UDPv6 sockets currently in use.

raw6sck

Number of RAWv6 sockets currently in use.

ip6-frag

Number of IPv6 fragments currently in use.

With the IP6 keyword, statistics about IPv6 network traffic are reported. Note that IPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

irec6/s

The total number of input datagrams received from interfaces per second, including those received in error [ipv6IfStatsInReceives].

fwddgm6/s

The number of output datagrams per second which this entity received and forwarded to their final destinations [ipv6IfStatsOutForwDatagrams].

idel6/s

The total number of datagrams successfully delivered per second to IPv6 user-protocols (including ICMP) [ipv6IfStatsInDelivers].

orq6/s

The total number of IPv6 datagrams which local IPv6 user-protocols (including ICMP) supplied per second to IPv6 in requests for transmission [ipv6IfStatsOutRequests]. Note that this counter does not include any datagrams counted in fwddgm6/s.

asmrq6/s

The number of IPv6 fragments received per second which needed to be reassembled at this interface [ipv6IfStatsReasmReqds].

asmok6/s

The number of IPv6 datagrams successfully reassembled per second [ipv6IfStatsReasmOKs].

imcpck6/s

The number of multicast packets received per second by the interface [ipv6IfStatsInMcastPkts].

omcpck6/s

The number of multicast packets transmitted per second by the interface [ipv6IfStatsOutMcastPkts].

fragok6/s

The number of IPv6 datagrams that have been successfully fragmented at this output interface per second [ipv6IfStatsOutFragOKs].

fragcr6/s

The number of output datagram fragments that have been generated per second as a result of fragmentation at this output interface [ipv6IfStatsOutFragCreates].

With the EIP6 keyword, statistics about IPv6 network errors are reported. Note that IPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

ihdrer6/s

The number of input datagrams discarded per second due to errors in their IPv6 headers, including version number mismatch, other format errors, hop count exceeded, errors discovered in processing their IPv6 options, etc. [ipv6IfStatsInHdrErrors]

iadrer6/s

The number of input datagrams discarded per second because the IPv6 address in their IPv6 header’s destination field was not a valid address to be received at this entity. This count includes invalid addresses (e.g., ::0) and unsupported addresses (e.g., addresses with unallocated prefixes). For entities which are not IPv6 routers and therefore do not forward datagrams, this counter includes datagrams discarded because the destination address was not a local address [ipv6IfStatsInAddrErrors].

iukwnp6/s

The number of locally-addressed datagrams received successfully but discarded per second because of an unknown or unsupported protocol [ipv6IfStatsInUnknownProtos].

i2big6/s

The number of input datagrams that could not be forwarded per second because their size exceeded the link MTU of outgoing interface [ipv6IfStatsInTooBigErrors].

idisc6/s

The number of input IPv6 datagrams per second for which no problems were encountered to prevent their continued processing, but which were discarded (e.g., for lack of buffer space) [ipv6IfStatsInDiscards]. Note that this counter does not include any datagrams discarded while awaiting re-assembly.

odisc6/s

The number of output IPv6 datagrams per second for which no problem was encountered to prevent their transmission to their destination, but which were discarded (e.g., for lack of buffer space) [ipv6IfStatsOutDiscards]. Note that this counter would include datagrams counted in fwddgm6/s if any such packets met this (discretionary) discard criterion.

inort6/s

The number of input datagrams discarded per second because no route could be found to transmit them to their destination [ipv6IfStatsInNoRoutes].

onort6/s

The number of locally generated IP datagrams discarded per second because no route could be found to transmit them to their destination [unknown formal SNMP name].

asmf6/s

The number of failures detected per second by the IPv6 re-assembly algorithm (for whatever reason: timed out, errors, etc.) [ipv6IfStatsReasmFails]. Note that this is not necessarily a count of discarded IPv6 fragments since some algorithms can lose track of the number of fragments by combining them as they are received.

fragf6/s

The number of IPv6 datagrams that have been discarded per second because they needed to be fragmented at this output interface but could not be [ipv6IfStatsOutFragFails].

itrpck6/s

The number of input datagrams discarded per second because datagram frame didn’t carry enough data [ipv6IfStatsInTruncatedPkts].

With the ICMP6 keyword, statistics about ICMPv6 network traffic are reported. Note that ICMPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

imsg6/s

The total number of ICMP messages received by the interface per second which includes all those counted by ierr6/s [ipv6IfIcmpInMsgs].

omsg6/s

The total number of ICMP messages which this interface attempted to send per second [ipv6IfIcmpOutMsgs].

iech6/s

The number of ICMP Echo (request) messages received by the interface per second [ipv6IfIcmpInEchos].

iechr6/s

The number of ICMP Echo Reply messages received by the interface per second [ipv6IfIcmpInEchoReplies].

oechr6/s

The number of ICMP Echo Reply messages sent by the interface per second [ipv6IfIcmpOutEchoReplies].

igmbq6/s

The number of ICMPv6 Group Membership Query messages received by the interface per second [ipv6IfIcmpInGroupMembQueries].

igmbr6/s

The number of ICMPv6 Group Membership Response messages received by the interface per second [ipv6IfIcmpInGroupMembResponses].

ogmbr6/s

The number of ICMPv6 Group Membership Response messages sent per second [ipv6IfIcmpOutGroupMembResponses].

igmbrd6/s

The number of ICMPv6 Group Membership Reduction messages received by the interface per second [ipv6IfIcmpInGroupMembReductions].

ogmbrd6/s

The number of ICMPv6 Group Membership Reduction messages sent per second [ipv6IfIcmpOutGroupMembReductions].

irtsol6/s

The number of ICMP Router Solicit messages received by the interface per second [ipv6IfIcmpInRouterSolicits].

ortsol6/s

The number of ICMP Router Solicitation messages sent by the interface per second [ipv6IfIcmpOutRouterSolicits].

irtad6/s

The number of ICMP Router Advertisement messages received by the interface per second [ipv6IfIcmpInRouterAdvertisements].

inbsol6/s

The number of ICMP Neighbor Solicit messages received by the interface per second [ipv6IfIcmpInNeighborSolicits].

onbsol6/s

The number of ICMP Neighbor Solicitation messages sent by the interface per second [ipv6IfIcmpOutNeighborSolicits].

inbad6/s

The number of ICMP Neighbor Advertisement messages received by the interface per second [ipv6IfIcmpInNeighborAdvertisements].

onbad6/s

The number of ICMP Neighbor Advertisement messages sent by the interface per second [ipv6IfIcmpOutNeighborAdvertisements].

With the EICMP6 keyword, statistics about ICMPv6 error messages are reported. Note that ICMPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

ierr6/s

The number of ICMP messages per second which the interface received but determined as having ICMP-specific errors (bad ICMP checksums, bad length, etc.) [ipv6IfIcmpInErrors]

idtunr6/s

The number of ICMP Destination Unreachable messages received by the interface per second [ipv6IfIcmpInDestUnreachs].

odtunr6/s

The number of ICMP Destination Unreachable messages sent by the interface per second [ipv6IfIcmpOutDestUnreachs].

itmex6/s

The number of ICMP Time Exceeded messages received by the interface per second [ipv6IfIcmpInTimeExcds].

otmex6/s

The number of ICMP Time Exceeded messages sent by the interface per second [ipv6IfIcmpOutTimeExcds].

iprmpb6/s

The number of ICMP Parameter Problem messages received by the interface per second [ipv6IfIcmpInParmProblems].

oprmpb6/s

The number of ICMP Parameter Problem messages sent by the interface per second [ipv6IfIcmpOutParmProblems].

iredir6/s

The number of Redirect messages received by the interface per second [ipv6IfIcmpInRedirects].

oredir6/s

The number of Redirect messages sent by the interface by second [ipv6IfIcmpOutRedirects].

ipck2b6/s

The number of ICMP Packet Too Big messages received by the interface per second [ipv6IfIcmpInPktTooBigs].

opck2b6/s

The number of ICMP Packet Too Big messages sent by the interface per second [ipv6IfIcmpOutPktTooBigs].

With the UDP6 keyword, statistics about UDPv6 network traffic are reported. Note that UDPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

idgm6/s

The total number of UDP datagrams delivered per second to UDP users [udpInDatagrams].

odgm6/s

The total number of UDP datagrams sent per second from this entity [udpOutDatagrams].

noport6/s

The total number of received UDP datagrams per second for which there was no application at the destination port [udpNoPorts].

idgmer6/s

The number of received UDP datagrams per second that could not be delivered for reasons other than the lack of an application at the destination port [udpInErrors].

The ALL keyword is equivalent to specifying all the keywords above and therefore all the network activities are reported.

-o [ filename ]

Save the readings in the file in binary form. Each reading is in a separate record. The default value of the filenameparameter is the current daily data file, the /var/log/sa/sadd file. The -o option is exclusive of the -f option. All the data available from the kernel are saved in the file (in fact, sar calls its data collector sadc with the option “-S ALL”. Seesadc(8) manual page).

-P { cpu [,…] | ALL }

Report per-processor statistics for the specified processor or processors. Specifying the ALL keyword reports statistics for each individual processor, and globally for all processors. Note that processor 0 is the first processor.

-p

Pretty-print device names. Use this option in conjunction with option -d. By default names are printed as dev m-nwhere m and n are the major and minor numbers for the device. Use of this option displays the names of the devices as they (should) appear in /dev. Name mappings are controlled by /etc/sysconfig/sysstat.ioconf.

-q

Report queue length and load averages. The following values are displayed:

负载情况。

runq-sz

Run queue length (number of tasks waiting for run time).

等待执行的队列长度。

plist-sz

Number of tasks in the task list.

总任务数目。

ldavg-1

System load average for the last minute. The load average is calculated as the average number of runnable or running tasks (R state), and the number of tasks in uninterruptible sleep (D state) over the specified interval.

1分钟负载情况,包括了 可执行、正在执行,以及不可中断的休眠状态的进程。

ldavg-5

System load average for the past 5 minutes.

ldavg-15

System load average for the past 15 minutes.

-r

Report memory utilization statistics. The following values are displayed:

内存使用情况监控。

kbmemfree

Amount of free memory available in kilobytes.

kbmemused

Amount of used memory in kilobytes. This does not take into account memory used by the kernel itself.

内存用量,未考虑内核使用的内存。

%memused

Percentage of used memory.

kbbuffers

Amount of memory used as buffers by the kernel in kilobytes.

内核使用的buffer大小,单位是KB。关于buffer和cache的区别,请参考Understanding free command in Linux/UnixOverview of memory management。But in future if any application want to use these buffers/cache, Linux will free it for you。

kbcached

Amount of memory used to cache data by the kernel in kilobytes.

内核使用的cache大小,单位是KB。But in future if any application want to use these buffers/cache, Linux will free it for you。

kbcommit

Amount of memory in kilobytes needed for current workload. This is an estimate of how much RAM/swap is needed to guarantee that there never is out of memory.

保证系统正常运行所需的内存,这是一个预测值,单位是KB。

%commit

Percentage of memory needed for current workload in relation to the total amount of memory (RAM+swap). This number may be greater than 100% because the kernel usually overcommits memory.

kbactive

Amount of active memory in kilobytes (memory that has been used more recently and usually not reclaimed unless absolutely necessary).

活跃内存用量,如果非必须,不会被reclaimed。

kbinact

Amount of inactive memory in kilobytes (memory which has been less recently used. It is more eligible to be reclaimed for other  purposes).

非活跃内存用量。

-R

Report memory statistics. The following values are displayed:

也是内存相关监控。

frmpg/s

Number of memory pages freed by the system per second. A negative value represents a number of pages allocated by the system. Note that a page has a size of 4 kB or 8 kB according to the machine architecture.

系统每秒释放的内存页数目。负值代表系统 申请的内存页 > 释放的内存页。内存页的大小可能是4kB或8KB。

bufpg/s

Number of additional memory pages used as buffers by the system per second. A negative value means fewer pages used as buffers by the system.

每秒用于buffer用途的内存页。

campg/s

Number of additional memory pages cached by the system per second. A negative value means fewer pages in the cache.

每秒用于cache用途的内存页。

-s [ hh:mm:ss ]

Set the starting time of the data, causing the sar command to extract records time-tagged at, or following, the time specified. The default starting time is 08:00:00. Hours must be given in 24-hour format. This option can be used only when data are read from a file (option -f ).

-S

Report swap space utilization statistics. The following values are displayed:

kbswpfree

Amount of free swap space in kilobytes.

kbswpused

Amount of used swap space in kilobytes.

%swpused

Percentage of used swap space.

kbswpcad

Amount of cached swap memory in kilobytes. This is memory that once was swapped out, is swapped back in but still also is in the swap area (if memory is needed it doesn’t need to be swapped out again because it is already in the swap area. This saves I/O).

%swpcad

Percentage of cached swap memory in relation to the amount of used swap space.

-t

When reading data from a daily data file, indicate that sar should display the timestamps in the original locale time of the data file creator. Without this option, the sar command displays the timestamps in the user’s locale time.

-u [ ALL ]

Report CPU utilization. The ALL keyword indicates that all the CPU fields should be displayed. The report may show the following fields:

CPU使用监控。也是默认监控,即sar -u  == sar。

%user

Percentage of CPU utilization that occurred while executing at the user level (application). Note that this field includes time spent running virtual processors.

%usr

Percentage of CPU utilization that occurred while executing at the user level (application). Note that this field does NOT include time spent running virtual processors.

%nice

Percentage of CPU utilization that occurred while executing at the user level with nice priority.

%system

Percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this field includes time spent servicing hardware and software interrupts.

%sys

Percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this field does NOT include time spent servicing hardware or software interrupts.

%iowait

Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

虽然CPU是空闲的,但还有未完成的IO请求的占比。

%steal

Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.

%irq

Percentage of time spent by the CPU or CPUs to service hardware interrupts.

硬中断的处理时间占比。

%soft

Percentage of time spent by the CPU or CPUs to service software interrupts.

软中断的处理时间占比。

%guest

Percentage of time spent by the CPU or CPUs to run a virtual processor.

%idle

Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.

CPU是空闲的,且没有未完成的IO请求。与%iowait的区别请注意。

Note: On SMP machines a processor that does not have any activity at all (0.00 for every field) is a disabled (offline) processor.

-v

Report status of inode, file and other kernel tables. The following values are displayed:

文件系统相关数据。

dentunusd

Number of unused cache entries in the directory cache.

file-nr

Number of file handles used by the system.

inode-nr

Number of inode handlers used by the system.

pty-nr

Number of pseudo-terminals used by the system.

-V

Print version number then exit.

-w

Report task creation and system switching activity.

任务调度数据。

proc/s

Total number of tasks created per second.

cswch/s

Total number of context switches per second.

-W

Report swapping statistics. The following values are displayed:

pswpin/s

Total number of swap pages the system brought in per second.

pswpout/s

Total number of swap pages the system brought out per second.

-y

Report TTY device activity. The following values are displayed:

rcvin/s

Number of receive interrupts per second for current serial line. Serial line number is given in the TTY column.

xmtin/s

Number of transmit interrupts per second for current serial line.

framerr/s

Number of frame errors per second for current serial line.

prtyerr/s

Number of parity errors per second for current serial line.

brk/s

Number of breaks per second for current serial line.

ovrun/s

Number of overrun errors per second for current serial line.

Note that with recent 2.6 kernels, these statistics can be retrieved only by root.

Environment

The sar command takes into account the following environment variables:

S_TIME_FORMAT

If this variable exists and its value is ISO then the current locale will be ignored when printing the date in the report header. The sar command will use the ISO 8601 format (YYYY-MM-DD) instead.

S_TIME_DEF_TIME

If this variable exists and its value is UTC then sar will save its data in UTC time (data will still be displayed in local time). sar will also use UTC time instead of local time to determine the current daily data file located in the /var/log/sadirectory. This variable may be useful for servers with users located across several timezones.

Examples

sar -u 2 5

Report CPU utilization for each 2 seconds. 5 lines are displayed.

sar -I 14 -o int14.file 2 10

Report statistics on IRQ 14 for each 2 seconds. 10 lines are displayed. Data are stored in a file called int14.file.

sar -r -n DEV -f /var/log/sa/sa16

Display memory and network statistics saved in daily data file ‘sa16′.

sar -A

Display all the statistics saved in current daily data file.

Bugs

/proc filesystem must be mounted for the sar command to work.

All the statistics are not necessarily available, depending on the kernel version used.

Files

/var/log/sa/sadd

Indicate the daily data file, where the dd parameter is a number representing the day of the month.

/proc contains various files with system statistics.

Author

Sebastien Godard (sysstat <at> orange.fr)

See Also

sadc(8)sa1(8)sa2(8)sadf(1)isag(1)pidstat(1)mpstat(1)iostat(1)vmstat(8)

 

存储型服务器

首先以一台存储服务器为例,进行内存相关分析。

free命令输出如下,首先需关注的是65953012B=62GB,但该系统物理内存是64GB,这是因为total不包含内核使用的内存。如果有硬件保留了部分内存,那也未计算在内。

从第二行可用看到,若视buf和cached都为可复用,则总可用内存还有54G,比较充足。但如果不包含buf和cached,完全free的内存只有268M,是非常小的。而尤其是cached是将常用磁盘数据缓存,如果频繁flush掉这块内存,对性能是有影响的,即本来读内存的操作变成了读磁盘的操作。那如何确定内存真的够使呢?往下看。

sar -r 1 5 查看内存整体使用情况如下,前面几列数据与free一致,关注kbactive和kbinact 字段,也表明如果发生memory reclaim,还有30G+ inact不经常使用的内存可用,一般不会触及经常active的内存。

sar -B 1 查看paging信息,下图是从中截取的一段。可用看到pgpgin/s在一段时间内持续为1024,而pgpgout/s是间歇性的突增。fault/s虽然较大,majflt/s为0,代表虽然发生了页中断,但主要是在虚拟内存到物理内存的映射上,基本没有从disk到内存的中断,所以性能可以接受。这也是跟该服务器承载业务相关,其上运行hbase, hdfs, zookeeper, thrift以及ETL C++进程,写多读很少,所以在没有发生split、merge的时候,基本没有从disk到磁盘的数据读入。

sar -R 1 10,这段时间新申请的内存页大于释放的内存页,且新增了少量用于buffer的内存页,新增了较多用于cached的内存页。

统计所有进程的vsz和rss用量,vsz达71G,rss仅9G。可以看到vsz虚拟内存已经超过了62G系统总内存,归功于虚拟内存系统,才可以使总内存看起来增加了。而rss内存小于free命令输出第二行的 -buffers/cache的11G,显然也不会包含buffer和cached内存。

$ ps axu | perl -ne ‘BEGIN{$vsz=0; $rss=0;} split; $vsz+=$_[4]; $rss+=$_[5];END{print $vsz .”\t”. $rss .”\n”;}’

71522748 9625152

其中vsz用量top10的进程如下,可以看到大部分进程的rss远远小于vsz,因为Linux针对用户进程的malloc内存会尽量延迟分配,而代码段、数据段等也只会使用到时,才依赖缺页中断从磁盘加载到内存中,否则只是虚拟内存的逻辑空间地址而已。

  1. thrift 4,901,424 486,352
  2. hbase region server 4,789,688 4,485,900
  3. hbase master 4,672,580 232,412
  4. zookeeper 4,588,768 85,508
  5. ETL C++ 1,885,824 787,656
  6. ETL C++ 1,862,092 774,448
  7. ETL C++ 1,682,696 414,276
  8. hadoop datanode 1,416,264 176,760
  9. hadoop datanode 1,376,536 177,436
  10. 监控平台进程 996,048 14,372

计算型Nginx服务器

再以一台Nginx服务器为例,分析内存。其buffer用量较少,仅290M,buffer主要用于块设备的读写缓存,应该也是由于nginx服务面向http请求,其数据吞吐与HBase相比较小。

sar -r 1 5:

sar -B 1:

sar -R 1 10,平均而言这段时间申请的内存页要多于释放的内存页,没有新增用于buffer的内存页,新增少量用于cached的内存页。

统计所有进程的vsz和rss用量,vsz仅18G,rss仅3G,可以看到这台服务器内存非常充沛。

$ ps axu | perl -ne ‘BEGIN{$vsz=0; $rss=0;} split; $vsz+=$_[4]; $rss+=$_[5];END{print $vsz .”\t”. $rss .”\n”;}’
18630732 3119256

 

 

参考资料

http://www.linuxatemyram.com/play.html

http://www.linuxhowtos.org/System/Linux%20Memory%20Management.htm

http://www.linuxnix.com/2013/05/find-ram-size-in-linuxunix.html

http://en.wikipedia.org/wiki/Paging

http://www.win.tue.nl/~aeb/linux/lk/lk-9.html

http://blog.csdn.net/dlutbrucezhang/article/details/9058583

http://oss.org.cn/kernel-book/

经常我们谈到Nginx比Apache快,一些topic里也就范范的解释为由于Nginx使用了epoll,而apache使用select。我也一直没有深究,直到最近讨论到这个问题,才想起真的问一句,所有的场景,我们都应该选择epoll吗?如果真是这样,为什么select没有被淘汰掉呢?

找到国外一位名叫George的人写的文章,以下根据自己的理解对其进行翻译和解释。

原文链接:http://www.ulduzsoft.com/2014/01/select-poll-epoll-practical-difference-for-system-architects/

select / poll / epoll: practical difference for system architects

select/poll/epoll:在系统架构实践中的区别

When designing a high performance networking application with non-blocking socket I/O, the architect needs to decide which polling method to use to monitor the events generated by those sockets. There are several such methods, and the use cases for each of them are different. Choosing the correct method may be critical to satisfy the application needs.

当我们在做非网络IO阻塞型的高性能网络应用架构时,需要决定应该使用何种polling方法,来监控sockets产生的事件。已经存在不少polling方法,且它们的适用场景也各有差别。而选择合适的polling方法,是满足高性能要求的关键!

This article highlights the difference among the polling methods and provides suggestions what to use.

这篇文章关注了几种polling方法的区别,并提供一些使用建议。

 

Contents

  • 1 Polling with select()
  • 2 Polling with poll()
  • 3 Polling with epoll()
  • 4 Polling with libevent

Polling with select()

 

Old, trusted workforce from the times the sockets were still called Berkeley sockets. It didn’t make it into the first specification though since there were no concept of non-blocking I/O at that moment, but it did make it around eighties, and nothing changed since that in its interface.

Select是古老的polling方法,从sockets还被称为伯克利sockets时代就存在了。由于那时还没有非阻塞型IO一说,它没能成为规范。但select从其诞生至今,接口一直没有发生过改变。

To use select, the developer needs to initialize and fill up several fd_set structures with the descriptors and the events to monitor, and then call select(). A typical workflow looks like that:

要使用select,开发者需要初始化一些fd_set结构体,包括其中的fd(描述符),以及该fd_set对应的监听事件(读、写、异常),然后调用select()。常见示例代码如下:

fd_set fd_in, fd_out;
structtimeval tv;
// Reset the sets
FD_ZERO( &fd_in );
FD_ZERO( &fd_out );
// Monitor sock1 for input events
FD_SET( sock1, &fd_in );
// Monitor sock2 for output events
FD_SET( sock1, &fd_out );
// Find out which socket has the largest numeric value as select requires it
intlargest_sock = sock1 > sock2 ? sock1 : sock2;
// Wait up to 10 seconds
tv.tv_sec = 10;
tv.tv_usec = 0;
// Call the select
intret = select( largest_sock, &fd_in, &fd_out, NULL, &tv );
// Check if select actually succeed
if( ret == -1 )
    // report error and abort
elseif( ret == 0 )
    // timeout; no event detected
else
{
    if( FD_ISSET( sock1, &fd_in ) )
        // input event on sock1
    if( FD_ISSET( sock2, &fd_out ) )
        // output event on sock2
}

When the select interface was designed and developed, nobody probably expected there would be multi-threaded applications serving many thousands connections. Hence select carries quite a few design flaws which make it undesirable as a polling mechanism in the modern networking application. The major disadvantages include:

在select出现的时候,还没有料到后来会由支持成百上千请求的多线程应用出现,所以select存在一些设计缺陷,使其不适用于现代的网络应用。突出的弊端有:

  • select modifies the passed fd_sets so none of them can be reused. Even if you don’t need to change anything – such as if one of descriptors received data and needs to receive more data – a whole set has to be either recreated again (argh!) or restored from a backup copy via FD_COPY. And this has to be done each time theselect is called.
  • select()方法会修改传入的fd_sets,所以这些变量不可以被复用。即使fd_sets的内容没有改变,例如本次被触发的fd还需要接收更多的数据所以无需删除它,整个儿fd_sets还是需要被重新传递给select()方法,或者重新生成fd_sets,或者从一个副本里复制过来。而且,每次select被调用之后,都需要重新做这件事情。
  • To find out which descriptors raised the events you have to manually iterate through all the descriptors in the set and call FD_ISSET on each one of them. When you have 2,000 of those descriptors and only one of them is active – and, likely, the last one – you’re wasting CPU cycles each time you wait.
  • 当有事件发生select()方法返回后,通过其返回值,开发者只能知道有没有ready的事件,还需要依次遍历所有的fds调用FD_ISSET,才可以知道事件发生在哪个fd上。在最坏的情况下,如果有2000个fds,且最后一个ready了,就非常浪费CPU了。
  • Did I just mention 2,000 descriptors? Well, select cannot support that much. At least on Linux. The maximum number of the supported descriptors is defined by the FD_SETSIZE constant, which Linux happily defines as 1024. And while some operating systems allow you to hack this restriction by redefining the FD_SETSIZEbefore including the sys/select.h, this is not portable. Indeed, Linux would just ignore this hack and the limit will stay the same.
  • 实际上select是无法支持2000个fds的,至少在linux上是这样。最大可支持的fds数目由Linux常量FD_SETSIZE指定,是1024。虽然一些系统允许通过hack的方式,在include sys/select.h头文件之前,重新定义FD_SETSIZE,但Linux会忽略并维持1024的限制。
  • You cannot modify the descriptor set from a different thread while waiting. Suppose a thread is executing the code above. Now suppose you have a housekeeping thread which decided that sock1 has been waiting too long for the input data, and it is time to cut the cord. Since this socket could be reused to serve anotherpaying working client, the housekeeping thread wants to close the socket. However the socket is in the fd_setwhich select is waiting for.
    Now what happens when this socket is closed? man select has the answer, and you won’t like it. The answer is, “If a file descriptor being monitored by select() is closed in another thread, the result is unspecified”.
  • 不支持跨线程修改被监听的fd_sets。假设有一个线程在执行上面那段代码,这时有另外一个清理进程发现sock1已经等待输入数据过久需要被回收。由于应用逻辑决定了这个socket没有办法被其他client复用,所以清理进程希望关闭它。但是请千万别这样做!通过man select页面可以找到答案:被select()方法监听的fd如果被其他线程关闭了,那么select()的行为是未定义的。
  • Same problem arises if another thread suddenly decides to send something via sock1. It is not possible to start monitoring the socket for the output event until select returns.
  • 如果另外一个线程决定要通过sock1发送数据也同样存在问题。在select()方法返回前,都没有办法启动对sock1写事件的监听(之前监听的是读事件)。(两个进程可以select监听同样的fd、相同or不同的读写事件吗?如果允许,有哪些限制呢?)
  • The choice of the events to wait for is limited; for example, to detect whether the remote socket is closed you have to a) monitor it for input and b) actually attempt to read the data from socket to detect the closure (readwill return 0). Which is fine if you want to read from this socket, but what if you’re sending a file and do not care about any input right now?
  • 监听的事件有限制。例如,为了监听远端socket是否被关闭,开发者需要首先监听该fd的读事件,然后尝试read接收数据以发现关闭(read会返回eof)。如果本身就需要从这个fd里读取数据,那还ok。但如果应用本来只是关注该端口上的写事件,压根不想read任何数据呢?
  • select puts extra burden on you when filling up the descriptor list to calculate the largest descriptor number and provide it as a function parameter.
  • 一个额外的小负担,需要找到监听读/写/异常fds里的最大值,从而给select()方法提供第一个参数(highest-numbered descriptor)

Of course the operating system developers recognized those drawbacks and addressed most of them when designing the poll method. Therefore you may ask, is there is any reason to use select at all? Why don’t just store it in the shelf of the Computer Science Museum? Then you may be pleased to know that yes, there are two reasons, which may be either very important to you or not important at all.

当然操作系统的开发者也发现了这些弊端并通过poll方法解决了大部分。那select是否还有用武之地呢?为什么不废弃select方法?这儿至少有两个原因会导致选用select。

The first reason is portability. select has been around for ages, and you can be sure that every single platform around which has network support and nonblocking sockets will have a working select implementation while it might not have poll at all. And unfortunately I’m not talking about the tubes and ENIAC here; poll is only available on Windows Vista and above which includes Windows XP – still used by the whooping 34% of users as of Sep 2013 despite the Microsoft pressure. Another option would be to still use poll on those platforms and emulate it with select on those which do not have it; it is up to you whether you consider it reasonable investment.

第一个原因是可移植性:select已经存在太久了,所有的支持网络和非阻塞sockets的平台都由select的实现,而它们可能没有poll方法。这样的平台不是指像tubes和ENIAC这样的古董,例如只有windows vista和更新的版本才支持poll,xp也没有poll。或者另一个选择是自己适配跨平台接口,针对不存在poll的OS,利用select封装出poll接口。

The second reason is more exotic, and is related to the fact that select can – theoretically – handle the timeouts withing the one nanosecond precision, while both poll and epoll can only handle the one millisecond precision. This is not likely to be a concern on a desktop or server system, which clocks doesn’t even run with such precision, but it may be necessary on a realtime embedded platform while interacting with some hardware components. Such as lowering control rods to shut down a nuclear reactor – in this case, please, use select to make sure we’re all stay safe!

第二个原因比较特殊,select的timeout时间理论上可以精确到纳秒,而poll和epoll只能到毫秒。在桌面系统和小型机上这可能不成问题,它们系统时钟也没用那么高的精确度。但嵌入式实时操作系统可能有这样的需求,例如核反应堆的控制系统。

The case above would probably be the only case where you would have to use select and could not use anything else. However if you are writing an application which would never have to handle more than a handful of sockets (like, 200), the difference between using poll and select would not be based on performance, but more on personal preference or other factors.

上面两个场景是必须使用select的原因。但当然,如果你的应用压根不会处理超过200个sockets,那select和poll的性能差别就不会那么大,完全可以出于个人喜好或其他原因进行选择了。

Polling with poll()

poll is a newer polling method which probably was created immediately after someone actually tried to write the high performance networking server. It is much better designed and doesn’t suffer from most of the problems which select has. In the vast majority of cases you would be choosing between poll and epoll/libevent.

出于高性能网络服务器的需求,poll()作为一种较新的polling方法出现了。它设计更优秀,解决了select的大部分问题。在很多场景下应该选择poll或者epoll/libevent。

To use poll, the developer needs to initialize the members of struct pollfd structure with the descriptors and events to monitor, and call the poll(). A typical workflow looks like that:

要用poll,开发者需要初始化pollfd结构体,包括fds和每个fd对应的监听事件,然后调用poll()方法。常见示例代码如下:

// The structure for two events
struct pollfd fds[2];
// Monitor sock1 for input
fds[0].fd = sock1;
fds[0].events = POLLIN;
// Monitor sock2 for output
fds[1].fd = sock2;
fds[1].events = POLLOUT;
// Wait 10 seconds
int ret = poll( &fds, 2, 10000 );

// Check if poll actually succeed
if ( ret == -1 )
    // report error and abort
else if ( ret == 0 )
    // timeout; no event detected
else
{
    // If we detect the event, zero it out so we can reuse the structure
    if ( fds[0].revents & POLLIN )
        fds[0].revents = 0;
        // input event on sock1

    if ( fds[1].revents & POLLOUT )
        fds[1].revents = 0;
        // output event on sock2
}

poll was mainly created to fix the pending problems select hadso it has the following advantages over it:

poll主要是为了解决select的问题,所以它具备以下优点:

  • There is no hard limit on the number of descriptors poll can monitor, so the limit of 1024 does not apply here.
  • 没有fd数量的硬性规定。
  • It does not modify the data passed in the struct pollfd data. Therefore it could be reused between the poll() calls as long as set to zero the revents member for those descriptors which generated the events. The IEEE specification states that “In each pollfd structure, poll() shall clear the revents member, except that where the application requested a report on a condition by setting one of the bits of events listed above, poll() shall set the corresponding bit in revents if the requested condition is true“. However in my experience at least one platform did not follow this recommendation, and man 2 poll on Linux does not make such guarantee either (man 3p poll does though).
  • poll()方法不会修改传入的pollfd参数,所以pollfd参数可以在多次poll调用中被重复使用,只要每次处理完把当前pollfd的revents字段重新置为0即可。
  • It allows more fine-grained control of events comparing to select. For example, it can detect remote peer shutdown without monitoring for read events.
  • 相对select而言,更可控。例如,可以检测到对端关闭事件,而无需监听读事件。

There are a few disadvantages as well, which were mentioned above at the end of the select section. Notably,poll is not present on Microsoft Windows older than Vista; on Vista and above it is called WSAPoll although the prototype is the same, and it could be defined as simply as:

如前所述,poll也有一些弊端。比vista更老的windows系统上没有poll方法,vista和更新的系统里它被实现为WASPoll,接口是一致的,可以被适配如下:

#if defined (WIN32)
static inline int poll( struct pollfd *pfd, int nfds, int timeout) { return WSAPoll ( pfd, nfds, timeout ); }
#endif

And, as mentioned above, poll timeout has the 1ms precision, which again is very unlikely to be a concern in most scenarios. Nevertheless poll still has a few issues which need to be kept in mind:

另外,如前所述,poll的timeout只能精确到1ms的精度,不过在大多数场景下都不是个事儿。但poll仍然有一些其他需要注意的问题:

  • Like select, it is still not possible to find out which descriptors have the events triggered without iterating through the whole list and checking the revents. Worse, the same happens in the kernel space as well, as the kernel has to iterate through the list of file descriptors to find out which sockets are monitored, and iterate through the whole list again to set up the events.
  • 和select()一样,开发者也需要遍历整个pollfds数组,才可以知道哪些fds被什么事件触发了。更糟的是,同样的事情也会发生在内核态,也就意味着操作系统也需要遍历所有fds来发现哪些sockets ready了,还需要遍历以set up监听事件。
  • Like select, it is not possible to dynamically modify the set or close the socket which is being polled (see above).
  • 和select()一样,也无法动态修改fds或关闭正在被监听的socket。

Often  those are important when implementing most client networking applications – the only exception would be a P2P client software which may handle thousands of connections. It may not be important even for some server applications. Therefore poll should be your default choice over select unless you have specific reasons mentioned above. More, poll should be your preferred method even over epoll if the following is true:

通常来说,在实现大多数客户端网络应用时,需要重点考虑poll的这些弊端,但如果是处理上千连接的p2p客户端软件就不同了。对某些服务端应用来说(poll的弊端)可能也不重要。所以除非有上面提到的特殊原因,相较于select而言,poll应该是默认的选择。甚至,如果有以下原因,也不应该选择epoll,而得使用poll:

  • You need to support more than just Linux, and do not want to use epoll wrappers such as libevent (epoll is Linux only);
  • 需要支持跨平台,而且不想使用类似libevent之类的适配方法(仅Linux支持epoll)
  • Your application needs to monitor less than 1000 sockets at a time (you are not likely to see any benefits from using epoll);
  • 仅需监听少于1000的网络连接(这时epoll没有什么优势)
  • Your application needs to monitor more than 1000 sockets at a time, but the connections are very short-lived (this is a close case, but most likely in this scenario you are not likely to see any benefits from using epoll because the speedup in event waiting would be wasted on adding those new descriptors into the set – see below)
  • 虽然需要监听多于1000的网络连接,但这些连接的生命周期都很短(虽然不是绝对,但该场景的大多数情况下,epoll没啥效果。因为事件等待节省的时间,都被浪费于将新建立的连接加入监听sets上了-下面有更详细的解释)
  • Your application is not designed the way that it changes the events while another thread is waiting for them (i.e. you’re not porting an app using kqueue or IO Completion Ports).
  • 非事件通知型的应用,即一个应用改变事件状态,另一个线程等待该事件的发生(例如你的应用没有使用kqueue 或 IO Completion Ports)

Polling with epoll()

epoll is the latest, greatest, newest polling method in Linux (and only Linux). Well, it was actually added to kernel in 2002, so it is not so new. It differs both from poll and select in such a way that it keeps the information about the currently monitored descriptors and associated events inside the kernel, and exports the API to add/remove/modify those.

epoll是Linux下最新最好的polling方法,也仅有Linux支持epoll。不过它其实在2002年已经被加入内核了,所以也不年轻了。它与select和epoll最大的区别是,它将当前被监听fds和事件相关的信息维持在内核空间里,并暴露增删改的API。

To use epoll, much more preparation is needed. A developer needs to:

要使用epoll,需要多做些准备工作。开发者需要:

  • Create the epoll descriptor by calling epoll_create;
  • 调用epoll_create生成epoll描述符
  • Initialize the struct epoll structure with the wanted events and the context data pointer. Context could be anything, epoll passes this value directly to the returned events structure. We store there a pointer to our Connection class.
  • 初始化epoll结构体,设置待监听事件和上下文数据指针。上下文数据可以是任意的,当事件触发时,epoll直接将它作为返回值的一部分。在下面的例子里,我们使用Connection对象的指针作为上下文。
  • Call epoll_ctl( … EPOLL_CTL_ADD ) to add the descriptor into the monitoring set
  • 调用epoll_ctl(…EPOLL_CTL_ADD )将描述符添加到监控set里(传入的参数包括 前面create的 epoll fd,待监听的端口fd,epoll结构体)
  • Call epoll_wait() to wait for 20 events for which we reserve the storage space. Unlike previous methods, this call receives an empty structure, and fills it up only with the triggered events. For example, if there are 200 descriptors and 5 of them have events pending, the epoll_wait will return 5, and only the first five members of the pevents structure will be initialized. If 50 descriptors have events pending, the first 20 would be copied and 30 would be left in queue, they won’t get lost.
  • 调用epoll_wait方法等待最多20个事件的返回。该方法接受一个空结构体pevents作为参数,并仅在有事件被触发时填充其值。例如,如果一共有200个被监听fds,只有5个被事件触发,那么epoll_wait返回5,且只有pevents的前5个槽位被填充了。而如果有50个事件,那么前20个事件被copy到pevents里,后30个被维持在epoll的queue里,不会丢失。
  • Iterate through the returned items. This will be a short iteration since the only events returned are those which are triggered.
  • 遍历返回的数据。这个遍历相对较短,因为仅包含了被触发的事件(不像之前select和poll,需要遍历所有的被监听fds)。

A typical workflow looks like that:

常见代码示例如下:

// Create the epoll descriptor. Only one is needed per app, and is used to monitor all sockets.
// The function argument is ignored (it was not before, but now it is), so put your favorite number here
int pollingfd = epoll_create( 0xCAFE ); 

if ( pollingfd < 0 )
 // report error

// Initialize the epoll structure in case more members are added in future
struct epoll_event ev = { 0 };

// Associate the connection class instance with the event. You can associate anything
// you want, epoll does not use this information. We store a connection class pointer, pConnection1
ev.data.ptr = pConnection1;

// Monitor for input, and do not automatically rearm the descriptor after the event
ev.events = EPOLLIN | EPOLLONESHOT;

// Add the descriptor into the monitoring list. We can do it even if another thread is // waiting in epoll_wait - the descriptor will be properly added
if ( epoll_ctl( epollfd pollingfd, EPOLL_CTL_ADD, pConnection1->getSocket(), &ev ) != 0 )
    // report error

// Wait for up to 20 events (assuming we have added maybe 200 sockets before that it may happen)
struct epoll_event pevents[ 20 ];

// Wait for 10 seconds
int ready = epoll_wait( pollingfd, pevents, 20, 10000 );

// Check if epoll actually succeed
if ( ret == -1 )
    // report error and abort
else if ( ret == 0 )
    // timeout; no event detected
else
{
    // Check if any events detected
    for ( int i = 0; i < ret; i++ )
    {
        if ( pevents[i].events & EPOLLIN )
        {
            // Get back our connection pointer
            Connection * c = (Connection*) pevents[i].data.ptr;
            c->handleReadEvent();
         }
    }
}

附 EPOLLONESHOT 的解释,摘自 man epoll_ctl:

EPOLLONESHOT

Sets the One-Shot behaviour for the associated file descriptor.

It  means  that after an event is pulled out with epoll_wait(2)

the associated file descriptor is internally  disabled  and  no

other  events will be reported by the epoll interface. The user

must call epoll_ctl(2) with EPOLL_CTL_MOD to re-enable the file

descriptor with a new event mask.

Just looking at the implementation alone should give you the hint of what are the disadvantages of epoll, which we will mention firs. It is more complex to use, and requires you to write more code, and it requires more library calls comparing to other polling methods.

只是看看上面的示例,就会直观的感觉到epoll的第一个弊端,相对其他polling方法而言,它太复杂了,需要写更多的代码,更多的函数调用。

However epoll has some significant advantages over select/poll both in terms of performance and functionality:

不过epoll确实在性能和功能上有一些优势:

  • epoll returns only the list of descriptors which triggered the events. No need to iterate through 10,000 descriptors to find that one which triggered the event!
  • epoll仅返回被触发事件的列表。无需遍历所有的监听描述符,只为了定位一个被触发的事件。
  • You can attach meaningful context to the monitored event instead of socket file descriptors. In our example we attached the class pointers which could be called directly, saving you another lookup.
  • 可以在被触发的事件粒度,附加上下文数据(即针对一个fd,使用不同的事件定义不同的epoll结构体,并add到监听set)。在上面的例子里,使用可以被直接调用的对象指针,进一步节省了找到合适回调方法的消耗。
  • You can add sockets or remove them from monitoring anytime, even if another thread is in the epoll_waitfunction. You can even modify the descriptor events. Everything will work properly, and this behavior is supported and documented. This gives you much more flexibility in implementation.
  • 可以随时增删被监听的端口,即使有其他线程被阻塞在epoll_wait()。甚至可以随时修改监听端口对应的事件。这些都是被支持的,且文档化的。这样在实现层面就增加了灵活性。
  • Since the kernel knows all the monitoring descriptors, it can register the events happening on them even when nobody is calling epoll_wait. This allows implementing interesting features such as edge triggering, which will be described in a separate article.
  • 由于内核有所有被监听端口的信息,所以即使没有任何用户调用epoll_wait,内核也可以注册已触发的事件。这些可以用于实现有趣的功能,例如edge triggering(边沿触发),会在单独的文章里介绍。
  • It is possible to have the multiple threads waiting on the same epoll queue with epoll_wait(), something you cannot do with select/poll. In fact it is not only possible with epoll, but the recommended method in the edge triggering mode.
  • 允许多个线程使用epoll_wait等待相同的epoll queue(拥有相同的epollfd),这些是select和poll无法做到的。事实上,当使用edge triggering这是建议的方式。

However you need to keep in mind that epoll is not a “better poll”, and it also has disadvantages when comparing to poll:

但是需要谨记的是,epoll不是万能的,相对poll而言,它也有其不足:

  • Changing the event flags (i.e. from READ to WRITE) requires the epoll_ctl syscall, while when using poll this is a simple bitmask operation done entirely in userspace. Switching 5,000 sockets from reading to writing withepoll would require 5,000 syscalls and hence context switches (as of 2014 calls to epoll_ctl still  could not be batched, and each descriptor must be changed separately), while in poll it would require a single loop over thepollfd structure.
  • 改变监听事件(例如从读改为写)需要调用epoll_ctl系统方法,而poll时这仅是一个用户空间的bit位设置。如果需要将5000个端口从监听读改为监听写,就需要5000次独立的系统调用(截止2014年,epoll_ctl还不支持批量,每个fd需要被独立设置),而poll里仅是针对pollfd结构体的一次遍历。
  • Each accept()ed socket needs to be added to the set, and same as above, with epoll it has to be done by calling epoll_ctl – which means there are two required syscalls per new connection socket instead of one for poll. If your server has many connections which do not send much information, epoll will likely take longer than poll to serve them.
  • 每一个accept()ed 端口都得被添加到监听set里,需要调用epoll_ctl系统函数 — 即每个连接需要两次系统调用(一次listen()ed端口,一次accept()ed端口),而poll只需要一次()。所以,如果你的服务器维持了很多连接,且都只交互少量数据,那epoll的耗时反而会大于poll。
  • epoll is exclusively Linux domain, and while other platforms have similar mechanisms, they are not exactly the same – edge triggering, for example, is pretty unique (FreeBSD’s kqueue supports it too though).
  • 仅Linux支持epoll,虽然其他操作系统也有类似的机制,但都略有不同。例如epoll的edge triggering就相当独特(虽然FreeBSD的kqueue也支持)。
  • High performance processing logic is more complex and hence more difficult to debug, especially for edge triggering which is prone to deadlocks if you miss extra read/write.
  • 高性能的逻辑实现更复杂,也就更难调试,尤其是在edge triggering时遗漏额外读写而导致的死锁。

Therefore you should only use epoll if all following is true:

所以,仅当以下条件都成立时,你才必须选择epoll:

  • Your application runs a thread poll which handles many network connections by a handful of threads. You would lose many benefits in a single-threaded application, and most likely it won’t outperform poll.
  • 应用中包含一个处理一些网络连接的线程池。不过你必定会损失单线程的一堆好处,而且性能也不一定比得上poll。
  • You expect to have a reasonably large number of sockets to monitor (at least 1,000); with a smaller number epoll is not likely to have any performance benefits over poll and may actually have worse performance;
  • 应用需要管理相当多的连接(至少1000)。少量连接情况下,epoll的性能可能还会弱于poll。
  • Your connections are relatively long-lived; as stated above epoll will be slower than poll in a situation when a new connection sends a few bytes of data and immediately disconnects because of extra system call required to add the descriptor into epoll set;
  • 网络连接的生命周期较长。如前所述,如果一个新的连接只是发送很少的数据,然后就关闭了,那epoll会由于额外系统调用(而导致的内核态切换)变得较慢。
  • Your app depends on other Linux-specific features (so in case portability question would suddenly pop up, epoll wouldn’t be the only roadblock), or you can provide wrappers for other supported systems. In the last case you should strongly consider libevent.
  • 应用还依赖其他Linux相关特性(所以就不需要考虑移植性了),或者开发者愿意针对其他平台提供适配方法。后者建议使用libevent。

If all the items above aren’t true, you should be better served by using poll instead.

如果以上都不成立的话,那还是用poll吧。

Polling with libevent

libebent is a library which basically wraps the polling methods listed in this article (and some others) in an uniform API.Its main advantage is that it allows you to write the code once and compile and run it on many operating systems without the need to change the code. It is important to understand that libevent it is just a wrapper built on top of the existing polling methods, and therefore it inherits the issues those polling methods have. It will not make select supporting more than 1024 sockets on Linux or allow epoll to modify the polling events without a syscall/context switch. Therefore it is still important to understand each method’s pros and cons.

libevent针对以上提到的polling方法和其他一些特性,提供了统一的API。它最大的好处是,允许写一份代码,然后编译执行于多个操作系统。但需要谨记的是,libevent也只是基于现有polling方法的封装,并没有改变任何问题。例如它不会使select支持多于1024个监听端口,也不会使epoll免于修改监听事件导致的内核态上下文切换。所以,还是需要理解每个polling方法的优劣。

Having to provide access to the functionality from the dramatically different methods, libevent has a rather complex API which is much more difficult to use than poll or even epoll. It is however easier to use libevent than to write two separate backends if you need to support FreeBSD (epoll and kqueue). Hence it is a viable alternative which should be considered if:

为了配置底层各种千差万别的polling方法,libevent的API更为复杂,但这也好过为了不同操作系统开发不同的代码。所以,如果以下条件满足,libevent不失为一个不错的选择:

  • Your application requirements indicate that you must use epoll, and using just poll would not be enough (ifpoll would satisfy your needs, it is extremely unlikely libevent would offer you any benefits)
  • 应用强烈需要用epoll(如果poll就够用了,那无需使用libevent)。
  • You need to support other OS than Linux, or may expect such need to arise in future. Again, this depends on other features of your application – if it is tied up to many other Linux-specific things you’re not going to achieve anything by using libevent instead of epoll
  • 现在或未来需要支持Linux和其他更多操作系统。当然这也依赖于应用所需的其他功能,如果很多linux特殊功能,那libevent也无法全部解决。

 

附man下各命令对应的领域:

1  用户命令,  可由任何人启动的。
2  系统调用,  即由内核提供的函数。
3  例程,   即库函数。
4  设备,   即/dev目录下的特殊文件。
5  文件格式描述,  例如/etc/passwd。
6  游戏,   不用解释啦!
7  杂项,   例如宏命令包、惯例等。
8  系统管理员工具, 只能由root启动。
9  其他(Linux特定的), 用来存放内核例行程序的文档。
n  新文档,  可能要移到更适合的领域。
o  老文档,  可能会在一段期限内保留。
l  本地文档,  与本特定系统有关的。

今年做的重要项目之一,就是对一个核心Web系统重构,使之达到了99.999% 的高可用性。在此过程中,积累了一些系统架构、自动故障发现与解决、代码健壮性等方面的经验,予以记录。

业务背景介绍

该web系统是一个大型互联网系统商业运营侧的核心web系统,PHP语言实现,之前的可用性方面存在一定问题,在历史上也出过不少事故。由于是商业系统,其PV仅是中等规模,但复杂度相对较高,体现在其所涉及的网络交互、DB调用、Cache交互较多。

需要解决的问题

  1. 所依赖的非核心上游服务不可用时,及时发现并自动降级
  2. 所依赖的上游服务部分节点不可用时,及时发现并自动摘除故障节点
  3. 通过Lib库,封装网络交互中的重试、连接和读写超时、异常日志和以上各种功能,使之对业务层透明

整体架构设计

该系统采用集中式+分布式相结合的异常发现和处理方案。由浅蓝色业务层和深蓝色基础服务设施一起,识别并处理异常攻击、网络、系统和上游服务问题。为了最大程度的解耦,它们交互的方式,就是规范化的文本日志和INI配置。

之所以采用集中+分布的混合方式,是由于所处系统环境的复杂度导致的:

  1. 对于异常流量,由于目前集群规模较大,单点不可能阈值配置过大(规则),也无法收集到全局信息(策略),只有集中式才可以综合判断;
  2. 对于上游服务的整体故障,同样的道理,单点也不可能阈值配置过小,否则很容易产生抖动,而集中式可以全局收集信息,全局联动;
  3. 但集中式的服务降级,无法很好解决以下两种问题,所以需要通过每个业务节点通过节点健康检查来处理。
    1. 不可降级的核心服务交互
    2. 点对点的非核心服务故障,例如上游服务部分机房故障,或某台服务器故障

全局故障监控与降级

故障采集:为了降低侵入性,采取了由业务模块打印warn、error log,各业务节点上的守护进程阻塞read日志文件,并发往多个(目前配置为2个)同构监控与调度服务的方式。之所以采用多写的方式,是由于一期实现时,复用了日志ETL流,走的是公司的消息队列和流式计算平台,而它们都是强一致性的,当网络出现抖动或拥塞时,故障日志甚至会延迟数小时之久!所以,最终我们决定采用最简单的方式,对于这种对实时性要求极高的消息交互,采用CS直接透传的方式。而交叉多写可以保证只要不是所有机房间链路都发生故障,就可以在秒级别完成消息传递。多个监控与调度服务完全同构,都会接收异常日志并予以计算和处理,由于重复降级不会产生副作用,所以没有做maser、slave角色的划分。

故障判别:出于扩展性和复用角度的考虑,故障判别的规则是基于Protobuf格式配置的,基本原理就是配置:某一种类型的异常日志在一段时间内到达指定阈值后,需要将什么配置项改为什么值。由于异常日志的数量一般不会太大,所以每个监控与调度服务都是独立运行的,不存在数据同步的问题,时间窗口、频率等信息都在内存中直接计算。当需要横向扩展时,可以与故障采集配合,根据业务模块、异常日志类型进行纵向划分,映射到不同的监控节点上。仅有一点需要注意的是,为了防止网络等问题导致消息延迟到达,在计算频率时,会过滤掉超时的消息。

故障处理:同样为了降低业务侵入性和耦合度,故障处理是通过修改业务模块的配置文件实现的。通过在PHP业务节点上部署我们的zk_agent,监听zk文件变化并修改业务模块的ini配置对应项,和复用C++进程的基于zk的热加载功能,监控与调度模块只需与zookeeper集群通信,修改指定配置项,而无需知晓业务模块的具体信息。上面提到的多实例方式,也可以保证只要有一个监控与调度模块与zk集群通信成功,就可以成功完成降级指令的发布。这里需要注意的是,由于机器规则总有其不完善处,引入了保险栓的方式,保证在开启保险栓的情况下,人工干预的优先级最高。

当然,这套流程也有其不完善处,例如为了避免网络抖动,我们是通过设置一个较大的阈值来实现的,而没有做更为精细化的处理。同时,其配置恢复是手工的,因为监控与调度模块为了降低耦合度和复杂度,没有主动去探测故障的恢复情况。

单点故障监控与摘除

作为自动降级的补充,该功能着眼于发现点对点的问题,并予以处理。其原理也是:该业务服务器上,在一段时间内,对某上游节点的调用若失败超过阈值,则屏蔽该上游节点。这里直接通过真实请求的失败进行计数,没有开启独立进程,也是为了降低复杂度,提升易用性。

采用策略模式,实现了APC和File两种存储介质的封装,并支持扩展。由于APC也是跨进程的,所以可以在单机所有PHP进程间共享失败次数、故障节点信息。经过测试,APC的性能优于File,所以一般推荐使用APC模式。但为了兼容那些未安装APC module的业务模块,所以也保留了File的方式。

由于大中型系统的点对点故障是频繁发生的,所以这里采用屏蔽超时自动恢复的方式,虽然可能会造成锯齿状的耗时波动,但无需人工干预,是综合而言最优的方式。

由于请求间可能有相互依赖关系,单点是无法handle降级的,所以一旦检查到某服务的上游节点全部不可用,则重新置位,设为全部可用。

PHP基础lib库

以上及其他未提及的稳定性保障,对于业务层而言略显复杂,且由每个业务开发者来保证这些也是不可靠的,所以必须有一套对业务透明的php Lib库,予以封装。我们实现了这样一套Lib,首先细致处理了Mysql、Memcache、Curl、Socket、Redis等网络交互时的重试、连接和读写超时。其次,以上网络组件又是基于代码层的负载均衡、节点健康检查、服务降级、配置组件,从而对上屏蔽稳定性细节。同时,该Lib充分考虑易用性,例如通过对多种Cache的封装,仅暴露一个简单的Call接口和规范化配置,就可以使业务层完成旁路cache的使用。

总结

通过以上方式,该系统在上线3个月内,自动处理了2000多次故障,实现了按请求成功率计算,99.999%以上的可用性。感触最深的是,很多做业务开发的PHPer,简单的认为稳定大多是运维同学的责任,但想要达到高稳定性,工作其实从架构设计就开始了,更深深渗透在每一行代码里!

本文通过总结当前项目的架构过程,希望抽象出通用思路、并指导我后续发展的方向。

  • 业务理解
  • 逻辑架构设计(模块划分依据)
  • 业务检查
  • 关键技术选型
  • 模块交互接口设计
  • 数据结构设计(粗粒度)
  • 模块架构设计(可能由具体工程师完成)
业务理解
  • input:需求文档(PM提供)
  • output:技术需求文档(架构师产出)

理解PM提出的需求,并对相似产品进行调研,在脑海里构建一个可以run起来的demo。如果发现不完整或不完善的地方,记录并及时与PM交流。然后尝试将其归类(业务复杂型、高并发、大数据处理型等),并与某些(多个)见到过的架构类比,找到参照物(可能没有办法找到100%一样的,但总能找到相似的)。

由于PM提出的需求,可能更多是功能性的,还需自行确认非功能性需求。这时可以从PV、UV、项目重要程度(影响稳定性等指标)等方面,并预估其项目发展速度(大中小型),从而给出吞吐量、性能、数据量、可用性、数据一致性、扩展性、区域部署或国际化等方面的需求,形成中间文档《非功能性需求文档》。

这时,就可以对照着需求文档,将其细化为《技术需求文档》。其中,可以把相似的功能聚类到一起,形成层级和依赖关系,这些就是潜在的模块划分和交互。

逻辑架构设计
  • input:技术需求文档
  • output:多种架构图、模块功能点说明

这阶段更多的是做逻辑架构(分模块架构),并结合数据交互图、关键case时序图,进行整体架构的描绘。

在设计逻辑架构时,就需要考虑“为什么这样划分模块?”出发点可能有:

  • 功能相关性和解耦,所以这些功能需要放在一个模块里,而其他功能不能放在这里
  • 复用性,可能多个模块都需要这些功能,所以其需要被抽取出来,作为横向通用模块
  • 依赖关系,从而形成分层架构
  • 发展前景,对于可能膨胀和需要优化的模块,需要小心制定其功能边界,确保其变动不会影响到其他模块
  • 其他的模块类型还有:底层存储层,垂直调度模块(各个层级都会使用的功能,例如配置中心),第三方插件,外部依赖服务,流量入口类型(严格说不算模块)
  • 监控和报表模块,必须牢记上线之后才是危险真正来临的时刻

在画完逻辑架构之后,数据交互图和关键case时序图,一方面是便于向他人传递架构师的思路,另一方面(很重要)是自己检查是否能满足功能性需求。

 业务检查

完成了整体架构后,还需要再比对着需求文档,进行检查。我的做法就是在脑子里过流程,确保从用户、客户、系统、数据等多个维度,可以跑通。这时,可能还会发现功能上的缺失,可以再回过头对需求、架构进行调整。

关键技术选型
  • input:多种架构图
  • output:技术选型与折衷文档

模块架构的完成,说实话离实现还远着呢。怎样实现这些架构,是这个阶段需要解决的问题:

  • 各个模块的语言选择:根据人力储备、功能类型(CPU密集型、IO阻塞型等)、工期要求、性能要求等进行选择
  • 网络交互协议、序列化协议的选择:一般建议采用公司通用协议,以降低学习和开发成本。例如服务端交互Protobuf,前后端交互Jsonp等
  • 存储层选型:业务计算复杂、数据量小的mysql,吞吐量大、性能要求高的Redis/Memcached等,海量数据的HDFS/Hbase/Hive
  • 依赖第三方技术选型:是否需要分布式计算(非实时的Hadoop之类,实时的Storm之类),是否需要CDN,是否需要图片处理,是否需要分布式文件系统(图片、视频、音频之类)等
  • 部署考虑:各模块是混布还是隔离?部署在哪些区域,区域间如何保持数据同步?如果进行国际化,是否存在网络时延、带宽、时差、本地化等问题(其中网络相关问题比较多)

以上技术选型时,可能会受到排期、人力制约,作出各种折衷,也需在此一一说明。当然,在作出折衷的时候,需要考虑清楚,后续升级或迁移的方案和成本。

另外,针对第三方依赖,架构师得能够给出合理的选择理由,对风险有足够的认识,必要的时候,得跟其维护者进行沟通,确认风险和最佳实践。

模块交互接口设计
  • input:多种架构图、模块功能说明文档、技术选型文档
  • output:模块交互接口文档(粗略稿)

结合上面的网络协议、序列化协议 和 模块功能,这里需要清晰给出各模块的功能边界(与各模块负责人需要达成共识),以及相互依赖关系(谁在什么情况下,调用谁),并在此基础上,制定各模块的交互接口。例如,是网络调用还是方法调用,如果是网络调用使用什么交互协议,是同步调用还是异步调用,对性能和稳定性有没有特殊要求。这时,可以不涉及接口的具体参数、返回值,而由模块详设时再行确认。

数据结构设计
  • input:多种架构图、模块功能说明文档、技术选型文档
  • output:数据结构设计文档(粗略稿)

在该阶段,可能无法给出具体的scheme设计,但需要确认哪些数据使用哪种存储介质,是否需要分库分表,是否能够满足功能和性能需求,是否有什么在详设时需要特别注意的技术点。若存在无法解决的需求问题,可能还需要修改技术选型。

完成以上工作之后,可以形成《整体架构设计文档》,并需要对其进行评审,一般需要其他团队高工,PM、RD、FE、QA、OP、DBA、UE等角色的参与。其他团队的高工主要负责看技术设计是否合适,有没有明显漏洞或优化点;PM和QA需要确保架构符合功能需求,QA还需要思考设计是否具备可测试性。OP、DBA需要从部署架构、存储选型方面予以审核。

模块架构设计
  • input:整体架构设计文档
  • output:各模块详设文档

此部分可以由模块负责人进行,但架构师需要全程参与,确保满足整体需求、没有功能遗漏或重复,模块中采用较优、具备可扩展性的架构设计。如果模块间出现技术配合问题,架构师得予以调解。

 

另外,为了按时按质完成以上工作,架构师需要在一开始就制定合理的日程表,细化到每一天需要完成什么事情(产生什么、与谁沟通)。在项目设计完和发布时等阶段性时间点,最好还能够进行回顾,总结出好的和不好的地方,以便下次改进。

网络通讯大部分是基于TCP/IP的,而TCP/IP是基于IP地址的,所以计算机在网络上进行通讯时只能识别如“202.96.134.133”之类的IP地址,而不能认识域名。我们无法记住10个以上IP地址的网站,所以我们访问网站时,更多的是在浏览器地址栏中输入域名,就能看到所需要的页面,这是因为有一个叫“DNS服务器”的计算机自动把我们的域名“翻译”成了相应的IP地址,然后调出IP地址所对应的网页。

什么是DNS?
DNS( Domain Name System)是“域名系统”的英文缩写,是一种组织成域层次结构的计算机和网络服务命名系统,它用于TCP/IP网络,它所提供的服务是用来将主机名和域名转换为IP地址的工作。DNS就是这样的一位“翻译官”,它的基本工作原理可用下图来表示。

DNS域名称
域名系统作为一个层次结构和分布式数据库,包含各种类型的数据,包括主机名和域名。DNS数据库中的名称形成一个分层树状结构称为域命名空间。域名包含单个标签分隔点,例如:im.qq.com。
完全限定的域名 (FQDN) 唯一地标识在 DNS 分层树中的主机的位置,通过指定的路径中点分隔从根引用的主机的名称列表。 下图显示与主机称为 im 内 qq.com DNS 树的示例。 主机的 FQDN 是 im.qq.com。
DNS 域的名称层次结构

DNS域名称空间的组织方式
按其功能命名空间中用来描述 DNS 域名称的五个类别的介绍详见下表中,以及与每个名称类型的示例。

DNS 和 Internet 域
互联网域名系统由名称注册机构负责维护分配由组织和国家/地区的顶级域在 Internet 上进行管理。 这些域名按照国际标准 3166。 一些很多现有缩写,保留以供组织中,以及两个字母和三个字母的国家/地区使用的缩写使用下表所示。一些常见的DNS域名称如下图:

资源记录
DNS 数据库中包含的资源记录 (RR)。 每个 RR 标识数据库中的特定资源。我们在建立DNS服务器时,经常会用到SOA,NS,A之类的记录,在维护DNS服务器时,会用到MX,CNAME记录。
常见的RR见下图:

Dns服务的工作过程
当 DNS 客户机需要查询程序中使用的名称时,它会查询本地DNS 服务器来解析该名称。客户机发送的每条查询消息都包括3条信息,以指定服务器应回答的问题。
● 指定的 DNS 域名,表示为完全合格的域名 (FQDN) 。
● 指定的查询类型,它可根据类型指定资源记录,或作为查询操作的专门类型。
● DNS域名的指定类别。
对于DNS 服务器,它始终应指定为 Internet 类别。例如,指定的名称可以是计算机的完全合格的域名,如im.qq.com,并且指定的查询类型用于通过该名称搜索地址资源记录。
DNS 查询以各种不同的方式进行解析。客户机有时也可通过使用从以前查询获得的缓存信息就地应答查询。DNS 服务器可使用其自身的资源记录信息缓存来应答查询,也可代表请求客户机来查询或联系其他 DNS 服务器,以完全解析该名称,并随后将应答返回至客户机。这个过程称为递归。
另外,客户机自己也可尝试联系其他的 DNS 服务器来解析名称。如果客户机这么做,它会使用基于服务器应答的独立和附加的查询,该过程称作迭代,即DNS服务器之间的交互查询就是迭代查询。
DNS 查询的过程如下图所示。

1、在浏览器中输入www.qq.com域名,操作系统会先检查自己本地的hosts文件是否有这个网址映射关系,如果有,就先调用这个IP地址映射,完成域名解析。

2、如果hosts里没有这个域名的映射,则查找本地DNS解析器缓存,是否有这个网址映射关系,如果有,直接返回,完成域名解析。

3、如果hosts与本地DNS解析器缓存都没有相应的网址映射关系,首先会找TCP/ip参数中设置的首选DNS服务器,在此我们叫它本地DNS服务器,此服务器收到查询时,如果要查询的域名,包含在本地配置区域资源中,则返回解析结果给客户机,完成域名解析,此解析具有权威性。

4、如果要查询的域名,不由本地DNS服务器区域解析,但该服务器已缓存了此网址映射关系,则调用这个IP地址映射,完成域名解析,此解析不具有权威性。

5、如果本地DNS服务器本地区域文件与缓存解析都失效,则根据本地DNS服务器的设置(是否设置转发器)进行查询,如果未用转发模式,本地DNS就把请求发至13台根DNS,根DNS服务器收到请求后会判断这个域名(.com)是谁来授权管理,并会返回一个负责该顶级域名服务器的一个IP。本地DNS服务器收到IP信息后,将会联系负责.com域的这台服务器。这台负责.com域的服务器收到请求后,如果自己无法解析,它就会找一个管理.com域的下一级DNS服务器地址(qq.com)给本地DNS服务器。当本地DNS服务器收到这个地址后,就会找qq.com域服务器,重复上面的动作,进行查询,直至找到www.qq.com主机。

6、如果用的是转发模式,此DNS服务器就会把请求转发至上一级DNS服务器,由上一级服务器进行解析,上一级服务器如果不能解析,或找根DNS或把转请求转至上上级,以此循环。不管是本地DNS服务器用是是转发,还是根提示,最后都是把结果返回给本地DNS服务器,由此DNS服务器再返回给客户机。

从客户端到本地DNS服务器是属于递归查询,而DNS服务器之间就是的交互查询就是迭代查询。

附录:
本地DNS配置转发与未配置转发数据包分析
新建一DNS,具体怎么建我这里就不再描述了,见我的上一篇博文《在Win2003中安装bind【部署智能DNS】》
1、DNS服务器不设转发
在192.168.145.228服务器上安装上wireshark软件,并打开它,设置数据包为UDP过滤,在192.168.145.12客户机上用nslookup命令查询一下www.sohu.com,马上可以看到本地DNS服务器直接查全球13台根域中的某几台,然后一步步解析,通过递代的方式,直到找到www.sohu.com对应的IP为220.181.118.87。
本地DNS服务器得到www.sohu.com的IP后,它把这个IP返回给192.168.145.12客户机,完成解析。

2、DNS服务器设置转发

因www.sohu.com域名在第一步的验证中使用过,有缓存,为了不受上步实验干扰,我们在客户机上192.168.145.12上nslookup www.baidu.com。从图上看,本地DNS把请求转发至192.168.133.10服务器,133.10服务器把得到的IP返回给本地DNS,然后本地DNS再把IP告诉DNS客户机,完成解析。

 

zz from: http://369369.blog.51cto.com/319630/812889

先收集:

http://www.cnblogs.com/panfeng412/archive/2011/10/28/realtime-computing-of-big-data.html

2013-09-11

1.1 敏捷联盟

敏捷这个词被滥用了很久,部分人认为站会、自动部署、没计划、没文档就是敏捷了。其实这都是片面且偏激的看法。从敏捷宣言的解读就可以看出,在强调个体能力和沟通的同时,它也不忽视工具。而在强调代码的同时,也不忽视文档,只是不要求“面面俱到”的文档,而是维护系统原理和架构文档,细节留给代码。而响应变化胜过遵循计划方面,我理解,在互联网行业里,从RD的角度来看,由于来自PM或用户的需求是多变的,所以无法生造出一个长远(横跨几个月)且不变的计划;但从PM或项目经理的角度,得有谱,得规划出产品的长远意义和走向,否则项目将失去灵魂。

1.2  原则

对于我们来说,与其规划一个耗时数月的产品,不如拆分为小功能,花上一两周作出第一版(需要有核心的产品价值和可快速迭代的架构),快速上线(不要局限于一个入口,尤其是需要有我们完全可控的入口),然后收集用户反馈、分析用户行为,持续升级和推广。这里不局限于一个入口,是有感于之前创业阶段,把宝都压在淘宝APP上,完全受制于对方的政策。可快速迭代的架构,我理解必须是内聚和解耦且拥有全面自动化回归case的,这样才可以放心的对某一些子功能动手术。所以,这里强调的是迭代规划、架构、人、数据监控与分析、推广

敏捷中人的作用尤其重要!对程序员的要求有:

  • 视软件如己出,为自己做事,且相信产品会为用户、世界带来极大的价值(后者可能对于我这样的人比较有意义)。只有这样,才有持续的动力 提高自己的技能、避免坏味并写出高质量的代码、主动发现有问题的点并修复它、自驱动领取力所能及的task等等。
  • 言出必行,由于强调的是面对面的交谈,文档、邮件仅作为备忘录记录大事件,所以对于细节更多是通过人的自律来保证的(在敏捷初始,可能还是需要通过细致的TODO list来保证吧?在涉及团队间交互时,接口文档还是必不可少的。)其实,个人觉得,这是不分行业的,是基本的要求,就是按时按质完成工作。

进度的评估,以可用功能完成进度为准,不包含调研、设计、文档、基础lib库的进展。因为后者都太虚无,PM或用户看不到真正的效果,也无法准确的验收进度。这也从另一方面,强迫敏捷开发团队将需求拆分为可独立上线的子功能,否则进度一直都是0%!

最后,每隔一段时间,敏捷团队需要坐下来回顾实施过程中的经验和困难,并作出调整。这一方面是积累,另一方面也可以看出敏捷的原则不是定死的,而是可以根据团队的情况,灵活应用。

 

2013-9-12

2.1.3 短交付周期

极限编程里有“发布计划”、“迭代计划”两个概念。前者是多个完整的story,进行一次发布或上线,持续3个月。后者虽然也由一个或多个story组成,但仅完成开发、测试并持续集成至版本仓库,不发布,持续2周。这种发布的频率可能是产生自传统软件行业,以通信行业为例,一个完整的系统交付可能持续数年甚至更长,每3个月做一个版本升级已经很快。但个人感觉虽然频率并不适合互联网,但思想仍可以借鉴。

3个月的发布计划,会迫使需求提出者作出较为长远的规划,避免需求无目的的堆积。而将多个有组织的需求,再拆分为较小的开发周期,并在此周期内保持需求的稳定,可以使开发者不至于因为需求的频繁变动而乱了节奏。但只要需求未被纳入开发迭代,提出者就可以对需求进行调整,也保持了快速应变的能力。

应用到互联网行业,我们是否仍然是做3个月的规划,2个周的迭代。改变的是迭代结束就上线,将效果交给最终的用户去评判,并加入监控和分析,对于影响较大的紧急反馈在RB分支里在几天内立刻修复并上线,而其他反馈并入随后的迭代里。

2.1.12 简单的设计、2.1.13 重构

简单的设计、不提前设计,这个在现实中如何平衡?而这两点,其实也是基于“重构”来的。简单的设计,在面对新需求,需要变更代码、lib库甚至架构时,如何处之?答案是有全面自动化case保证的重构和高质量的工程师!但现实中,100%全面的case是不存在的,人的方面更是变化多端。所以,个人觉得,需要折中。

  • 欢迎重构,但尽量把重构放在本迭代中,不把坏味代码留到发布版本里,减少后面为了消除代码坏味而进行的重构。
  • 通过数据推测性能需求,通过对需求提出者的诘问推测功能需求,为可预测的将来做准备。
  • 在需要对已有功能做重构时,化整为零,每重构完一个小功能就build,确保无误;并且尽量采取小流量的方式先试点再推广。

2.1.14 隐喻

TODO

背景

原理

https端口proxy

crt chain 配置