Archive for 三月, 2015




select / poll / epoll: practical difference for system architects


When designing a high performance networking application with non-blocking socket I/O, the architect needs to decide which polling method to use to monitor the events generated by those sockets. There are several such methods, and the use cases for each of them are different. Choosing the correct method may be critical to satisfy the application needs.


This article highlights the difference among the polling methods and provides suggestions what to use.




  • 1 Polling with select()
  • 2 Polling with poll()
  • 3 Polling with epoll()
  • 4 Polling with libevent

Polling with select()


Old, trusted workforce from the times the sockets were still called Berkeley sockets. It didn’t make it into the first specification though since there were no concept of non-blocking I/O at that moment, but it did make it around eighties, and nothing changed since that in its interface.


To use select, the developer needs to initialize and fill up several fd_set structures with the descriptors and the events to monitor, and then call select(). A typical workflow looks like that:


fd_set fd_in, fd_out;
structtimeval tv;
// Reset the sets
FD_ZERO( &fd_in );
FD_ZERO( &fd_out );
// Monitor sock1 for input events
FD_SET( sock1, &fd_in );
// Monitor sock2 for output events
FD_SET( sock1, &fd_out );
// Find out which socket has the largest numeric value as select requires it
intlargest_sock = sock1 > sock2 ? sock1 : sock2;
// Wait up to 10 seconds
tv.tv_sec = 10;
tv.tv_usec = 0;
// Call the select
intret = select( largest_sock, &fd_in, &fd_out, NULL, &tv );
// Check if select actually succeed
if( ret == -1 )
    // report error and abort
elseif( ret == 0 )
    // timeout; no event detected
    if( FD_ISSET( sock1, &fd_in ) )
        // input event on sock1
    if( FD_ISSET( sock2, &fd_out ) )
        // output event on sock2

When the select interface was designed and developed, nobody probably expected there would be multi-threaded applications serving many thousands connections. Hence select carries quite a few design flaws which make it undesirable as a polling mechanism in the modern networking application. The major disadvantages include:


  • select modifies the passed fd_sets so none of them can be reused. Even if you don’t need to change anything – such as if one of descriptors received data and needs to receive more data – a whole set has to be either recreated again (argh!) or restored from a backup copy via FD_COPY. And this has to be done each time theselect is called.
  • select()方法会修改传入的fd_sets,所以这些变量不可以被复用。即使fd_sets的内容没有改变,例如本次被触发的fd还需要接收更多的数据所以无需删除它,整个儿fd_sets还是需要被重新传递给select()方法,或者重新生成fd_sets,或者从一个副本里复制过来。而且,每次select被调用之后,都需要重新做这件事情。
  • To find out which descriptors raised the events you have to manually iterate through all the descriptors in the set and call FD_ISSET on each one of them. When you have 2,000 of those descriptors and only one of them is active – and, likely, the last one – you’re wasting CPU cycles each time you wait.
  • 当有事件发生select()方法返回后,通过其返回值,开发者只能知道有没有ready的事件,还需要依次遍历所有的fds调用FD_ISSET,才可以知道事件发生在哪个fd上。在最坏的情况下,如果有2000个fds,且最后一个ready了,就非常浪费CPU了。
  • Did I just mention 2,000 descriptors? Well, select cannot support that much. At least on Linux. The maximum number of the supported descriptors is defined by the FD_SETSIZE constant, which Linux happily defines as 1024. And while some operating systems allow you to hack this restriction by redefining the FD_SETSIZEbefore including the sys/select.h, this is not portable. Indeed, Linux would just ignore this hack and the limit will stay the same.
  • 实际上select是无法支持2000个fds的,至少在linux上是这样。最大可支持的fds数目由Linux常量FD_SETSIZE指定,是1024。虽然一些系统允许通过hack的方式,在include sys/select.h头文件之前,重新定义FD_SETSIZE,但Linux会忽略并维持1024的限制。
  • You cannot modify the descriptor set from a different thread while waiting. Suppose a thread is executing the code above. Now suppose you have a housekeeping thread which decided that sock1 has been waiting too long for the input data, and it is time to cut the cord. Since this socket could be reused to serve anotherpaying working client, the housekeeping thread wants to close the socket. However the socket is in the fd_setwhich select is waiting for.
    Now what happens when this socket is closed? man select has the answer, and you won’t like it. The answer is, “If a file descriptor being monitored by select() is closed in another thread, the result is unspecified”.
  • 不支持跨线程修改被监听的fd_sets。假设有一个线程在执行上面那段代码,这时有另外一个清理进程发现sock1已经等待输入数据过久需要被回收。由于应用逻辑决定了这个socket没有办法被其他client复用,所以清理进程希望关闭它。但是请千万别这样做!通过man select页面可以找到答案:被select()方法监听的fd如果被其他线程关闭了,那么select()的行为是未定义的。
  • Same problem arises if another thread suddenly decides to send something via sock1. It is not possible to start monitoring the socket for the output event until select returns.
  • 如果另外一个线程决定要通过sock1发送数据也同样存在问题。在select()方法返回前,都没有办法启动对sock1写事件的监听(之前监听的是读事件)。(两个进程可以select监听同样的fd、相同or不同的读写事件吗?如果允许,有哪些限制呢?)
  • The choice of the events to wait for is limited; for example, to detect whether the remote socket is closed you have to a) monitor it for input and b) actually attempt to read the data from socket to detect the closure (readwill return 0). Which is fine if you want to read from this socket, but what if you’re sending a file and do not care about any input right now?
  • 监听的事件有限制。例如,为了监听远端socket是否被关闭,开发者需要首先监听该fd的读事件,然后尝试read接收数据以发现关闭(read会返回eof)。如果本身就需要从这个fd里读取数据,那还ok。但如果应用本来只是关注该端口上的写事件,压根不想read任何数据呢?
  • select puts extra burden on you when filling up the descriptor list to calculate the largest descriptor number and provide it as a function parameter.
  • 一个额外的小负担,需要找到监听读/写/异常fds里的最大值,从而给select()方法提供第一个参数(highest-numbered descriptor)

Of course the operating system developers recognized those drawbacks and addressed most of them when designing the poll method. Therefore you may ask, is there is any reason to use select at all? Why don’t just store it in the shelf of the Computer Science Museum? Then you may be pleased to know that yes, there are two reasons, which may be either very important to you or not important at all.


The first reason is portability. select has been around for ages, and you can be sure that every single platform around which has network support and nonblocking sockets will have a working select implementation while it might not have poll at all. And unfortunately I’m not talking about the tubes and ENIAC here; poll is only available on Windows Vista and above which includes Windows XP – still used by the whooping 34% of users as of Sep 2013 despite the Microsoft pressure. Another option would be to still use poll on those platforms and emulate it with select on those which do not have it; it is up to you whether you consider it reasonable investment.

第一个原因是可移植性:select已经存在太久了,所有的支持网络和非阻塞sockets的平台都由select的实现,而它们可能没有poll方法。这样的平台不是指像tubes和ENIAC这样的古董,例如只有windows vista和更新的版本才支持poll,xp也没有poll。或者另一个选择是自己适配跨平台接口,针对不存在poll的OS,利用select封装出poll接口。

The second reason is more exotic, and is related to the fact that select can – theoretically – handle the timeouts withing the one nanosecond precision, while both poll and epoll can only handle the one millisecond precision. This is not likely to be a concern on a desktop or server system, which clocks doesn’t even run with such precision, but it may be necessary on a realtime embedded platform while interacting with some hardware components. Such as lowering control rods to shut down a nuclear reactor – in this case, please, use select to make sure we’re all stay safe!


The case above would probably be the only case where you would have to use select and could not use anything else. However if you are writing an application which would never have to handle more than a handful of sockets (like, 200), the difference between using poll and select would not be based on performance, but more on personal preference or other factors.


Polling with poll()

poll is a newer polling method which probably was created immediately after someone actually tried to write the high performance networking server. It is much better designed and doesn’t suffer from most of the problems which select has. In the vast majority of cases you would be choosing between poll and epoll/libevent.


To use poll, the developer needs to initialize the members of struct pollfd structure with the descriptors and events to monitor, and call the poll(). A typical workflow looks like that:


// The structure for two events
struct pollfd fds[2];
// Monitor sock1 for input
fds[0].fd = sock1;
fds[0].events = POLLIN;
// Monitor sock2 for output
fds[1].fd = sock2;
fds[1].events = POLLOUT;
// Wait 10 seconds
int ret = poll( &fds, 2, 10000 );

// Check if poll actually succeed
if ( ret == -1 )
    // report error and abort
else if ( ret == 0 )
    // timeout; no event detected
    // If we detect the event, zero it out so we can reuse the structure
    if ( fds[0].revents & POLLIN )
        fds[0].revents = 0;
        // input event on sock1

    if ( fds[1].revents & POLLOUT )
        fds[1].revents = 0;
        // output event on sock2

poll was mainly created to fix the pending problems select hadso it has the following advantages over it:


  • There is no hard limit on the number of descriptors poll can monitor, so the limit of 1024 does not apply here.
  • 没有fd数量的硬性规定。
  • It does not modify the data passed in the struct pollfd data. Therefore it could be reused between the poll() calls as long as set to zero the revents member for those descriptors which generated the events. The IEEE specification states that “In each pollfd structure, poll() shall clear the revents member, except that where the application requested a report on a condition by setting one of the bits of events listed above, poll() shall set the corresponding bit in revents if the requested condition is true“. However in my experience at least one platform did not follow this recommendation, and man 2 poll on Linux does not make such guarantee either (man 3p poll does though).
  • poll()方法不会修改传入的pollfd参数,所以pollfd参数可以在多次poll调用中被重复使用,只要每次处理完把当前pollfd的revents字段重新置为0即可。
  • It allows more fine-grained control of events comparing to select. For example, it can detect remote peer shutdown without monitoring for read events.
  • 相对select而言,更可控。例如,可以检测到对端关闭事件,而无需监听读事件。

There are a few disadvantages as well, which were mentioned above at the end of the select section. Notably,poll is not present on Microsoft Windows older than Vista; on Vista and above it is called WSAPoll although the prototype is the same, and it could be defined as simply as:


#if defined (WIN32)
static inline int poll( struct pollfd *pfd, int nfds, int timeout) { return WSAPoll ( pfd, nfds, timeout ); }

And, as mentioned above, poll timeout has the 1ms precision, which again is very unlikely to be a concern in most scenarios. Nevertheless poll still has a few issues which need to be kept in mind:


  • Like select, it is still not possible to find out which descriptors have the events triggered without iterating through the whole list and checking the revents. Worse, the same happens in the kernel space as well, as the kernel has to iterate through the list of file descriptors to find out which sockets are monitored, and iterate through the whole list again to set up the events.
  • 和select()一样,开发者也需要遍历整个pollfds数组,才可以知道哪些fds被什么事件触发了。更糟的是,同样的事情也会发生在内核态,也就意味着操作系统也需要遍历所有fds来发现哪些sockets ready了,还需要遍历以set up监听事件。
  • Like select, it is not possible to dynamically modify the set or close the socket which is being polled (see above).
  • 和select()一样,也无法动态修改fds或关闭正在被监听的socket。

Often  those are important when implementing most client networking applications – the only exception would be a P2P client software which may handle thousands of connections. It may not be important even for some server applications. Therefore poll should be your default choice over select unless you have specific reasons mentioned above. More, poll should be your preferred method even over epoll if the following is true:


  • You need to support more than just Linux, and do not want to use epoll wrappers such as libevent (epoll is Linux only);
  • 需要支持跨平台,而且不想使用类似libevent之类的适配方法(仅Linux支持epoll)
  • Your application needs to monitor less than 1000 sockets at a time (you are not likely to see any benefits from using epoll);
  • 仅需监听少于1000的网络连接(这时epoll没有什么优势)
  • Your application needs to monitor more than 1000 sockets at a time, but the connections are very short-lived (this is a close case, but most likely in this scenario you are not likely to see any benefits from using epoll because the speedup in event waiting would be wasted on adding those new descriptors into the set – see below)
  • 虽然需要监听多于1000的网络连接,但这些连接的生命周期都很短(虽然不是绝对,但该场景的大多数情况下,epoll没啥效果。因为事件等待节省的时间,都被浪费于将新建立的连接加入监听sets上了-下面有更详细的解释)
  • Your application is not designed the way that it changes the events while another thread is waiting for them (i.e. you’re not porting an app using kqueue or IO Completion Ports).
  • 非事件通知型的应用,即一个应用改变事件状态,另一个线程等待该事件的发生(例如你的应用没有使用kqueue 或 IO Completion Ports)

Polling with epoll()

epoll is the latest, greatest, newest polling method in Linux (and only Linux). Well, it was actually added to kernel in 2002, so it is not so new. It differs both from poll and select in such a way that it keeps the information about the currently monitored descriptors and associated events inside the kernel, and exports the API to add/remove/modify those.


To use epoll, much more preparation is needed. A developer needs to:


  • Create the epoll descriptor by calling epoll_create;
  • 调用epoll_create生成epoll描述符
  • Initialize the struct epoll structure with the wanted events and the context data pointer. Context could be anything, epoll passes this value directly to the returned events structure. We store there a pointer to our Connection class.
  • 初始化epoll结构体,设置待监听事件和上下文数据指针。上下文数据可以是任意的,当事件触发时,epoll直接将它作为返回值的一部分。在下面的例子里,我们使用Connection对象的指针作为上下文。
  • Call epoll_ctl( … EPOLL_CTL_ADD ) to add the descriptor into the monitoring set
  • 调用epoll_ctl(…EPOLL_CTL_ADD )将描述符添加到监控set里(传入的参数包括 前面create的 epoll fd,待监听的端口fd,epoll结构体)
  • Call epoll_wait() to wait for 20 events for which we reserve the storage space. Unlike previous methods, this call receives an empty structure, and fills it up only with the triggered events. For example, if there are 200 descriptors and 5 of them have events pending, the epoll_wait will return 5, and only the first five members of the pevents structure will be initialized. If 50 descriptors have events pending, the first 20 would be copied and 30 would be left in queue, they won’t get lost.
  • 调用epoll_wait方法等待最多20个事件的返回。该方法接受一个空结构体pevents作为参数,并仅在有事件被触发时填充其值。例如,如果一共有200个被监听fds,只有5个被事件触发,那么epoll_wait返回5,且只有pevents的前5个槽位被填充了。而如果有50个事件,那么前20个事件被copy到pevents里,后30个被维持在epoll的queue里,不会丢失。
  • Iterate through the returned items. This will be a short iteration since the only events returned are those which are triggered.
  • 遍历返回的数据。这个遍历相对较短,因为仅包含了被触发的事件(不像之前select和poll,需要遍历所有的被监听fds)。

A typical workflow looks like that:


// Create the epoll descriptor. Only one is needed per app, and is used to monitor all sockets.
// The function argument is ignored (it was not before, but now it is), so put your favorite number here
int pollingfd = epoll_create( 0xCAFE ); 

if ( pollingfd < 0 )
 // report error

// Initialize the epoll structure in case more members are added in future
struct epoll_event ev = { 0 };

// Associate the connection class instance with the event. You can associate anything
// you want, epoll does not use this information. We store a connection class pointer, pConnection1 = pConnection1;

// Monitor for input, and do not automatically rearm the descriptor after the event = EPOLLIN | EPOLLONESHOT;

// Add the descriptor into the monitoring list. We can do it even if another thread is // waiting in epoll_wait - the descriptor will be properly added
if ( epoll_ctl( epollfd pollingfd, EPOLL_CTL_ADD, pConnection1->getSocket(), &ev ) != 0 )
    // report error

// Wait for up to 20 events (assuming we have added maybe 200 sockets before that it may happen)
struct epoll_event pevents[ 20 ];

// Wait for 10 seconds
int ready = epoll_wait( pollingfd, pevents, 20, 10000 );

// Check if epoll actually succeed
if ( ret == -1 )
    // report error and abort
else if ( ret == 0 )
    // timeout; no event detected
    // Check if any events detected
    for ( int i = 0; i < ret; i++ )
        if ( pevents[i].events & EPOLLIN )
            // Get back our connection pointer
            Connection * c = (Connection*) pevents[i].data.ptr;

附 EPOLLONESHOT 的解释,摘自 man epoll_ctl:


Sets the One-Shot behaviour for the associated file descriptor.

It  means  that after an event is pulled out with epoll_wait(2)

the associated file descriptor is internally  disabled  and  no

other  events will be reported by the epoll interface. The user

must call epoll_ctl(2) with EPOLL_CTL_MOD to re-enable the file

descriptor with a new event mask.

Just looking at the implementation alone should give you the hint of what are the disadvantages of epoll, which we will mention firs. It is more complex to use, and requires you to write more code, and it requires more library calls comparing to other polling methods.


However epoll has some significant advantages over select/poll both in terms of performance and functionality:


  • epoll returns only the list of descriptors which triggered the events. No need to iterate through 10,000 descriptors to find that one which triggered the event!
  • epoll仅返回被触发事件的列表。无需遍历所有的监听描述符,只为了定位一个被触发的事件。
  • You can attach meaningful context to the monitored event instead of socket file descriptors. In our example we attached the class pointers which could be called directly, saving you another lookup.
  • 可以在被触发的事件粒度,附加上下文数据(即针对一个fd,使用不同的事件定义不同的epoll结构体,并add到监听set)。在上面的例子里,使用可以被直接调用的对象指针,进一步节省了找到合适回调方法的消耗。
  • You can add sockets or remove them from monitoring anytime, even if another thread is in the epoll_waitfunction. You can even modify the descriptor events. Everything will work properly, and this behavior is supported and documented. This gives you much more flexibility in implementation.
  • 可以随时增删被监听的端口,即使有其他线程被阻塞在epoll_wait()。甚至可以随时修改监听端口对应的事件。这些都是被支持的,且文档化的。这样在实现层面就增加了灵活性。
  • Since the kernel knows all the monitoring descriptors, it can register the events happening on them even when nobody is calling epoll_wait. This allows implementing interesting features such as edge triggering, which will be described in a separate article.
  • 由于内核有所有被监听端口的信息,所以即使没有任何用户调用epoll_wait,内核也可以注册已触发的事件。这些可以用于实现有趣的功能,例如edge triggering(边沿触发),会在单独的文章里介绍。
  • It is possible to have the multiple threads waiting on the same epoll queue with epoll_wait(), something you cannot do with select/poll. In fact it is not only possible with epoll, but the recommended method in the edge triggering mode.
  • 允许多个线程使用epoll_wait等待相同的epoll queue(拥有相同的epollfd),这些是select和poll无法做到的。事实上,当使用edge triggering这是建议的方式。

However you need to keep in mind that epoll is not a “better poll”, and it also has disadvantages when comparing to poll:


  • Changing the event flags (i.e. from READ to WRITE) requires the epoll_ctl syscall, while when using poll this is a simple bitmask operation done entirely in userspace. Switching 5,000 sockets from reading to writing withepoll would require 5,000 syscalls and hence context switches (as of 2014 calls to epoll_ctl still  could not be batched, and each descriptor must be changed separately), while in poll it would require a single loop over thepollfd structure.
  • 改变监听事件(例如从读改为写)需要调用epoll_ctl系统方法,而poll时这仅是一个用户空间的bit位设置。如果需要将5000个端口从监听读改为监听写,就需要5000次独立的系统调用(截止2014年,epoll_ctl还不支持批量,每个fd需要被独立设置),而poll里仅是针对pollfd结构体的一次遍历。
  • Each accept()ed socket needs to be added to the set, and same as above, with epoll it has to be done by calling epoll_ctl – which means there are two required syscalls per new connection socket instead of one for poll. If your server has many connections which do not send much information, epoll will likely take longer than poll to serve them.
  • 每一个accept()ed 端口都得被添加到监听set里,需要调用epoll_ctl系统函数 — 即每个连接需要两次系统调用(一次listen()ed端口,一次accept()ed端口),而poll只需要一次()。所以,如果你的服务器维持了很多连接,且都只交互少量数据,那epoll的耗时反而会大于poll。
  • epoll is exclusively Linux domain, and while other platforms have similar mechanisms, they are not exactly the same – edge triggering, for example, is pretty unique (FreeBSD’s kqueue supports it too though).
  • 仅Linux支持epoll,虽然其他操作系统也有类似的机制,但都略有不同。例如epoll的edge triggering就相当独特(虽然FreeBSD的kqueue也支持)。
  • High performance processing logic is more complex and hence more difficult to debug, especially for edge triggering which is prone to deadlocks if you miss extra read/write.
  • 高性能的逻辑实现更复杂,也就更难调试,尤其是在edge triggering时遗漏额外读写而导致的死锁。

Therefore you should only use epoll if all following is true:


  • Your application runs a thread poll which handles many network connections by a handful of threads. You would lose many benefits in a single-threaded application, and most likely it won’t outperform poll.
  • 应用中包含一个处理一些网络连接的线程池。不过你必定会损失单线程的一堆好处,而且性能也不一定比得上poll。
  • You expect to have a reasonably large number of sockets to monitor (at least 1,000); with a smaller number epoll is not likely to have any performance benefits over poll and may actually have worse performance;
  • 应用需要管理相当多的连接(至少1000)。少量连接情况下,epoll的性能可能还会弱于poll。
  • Your connections are relatively long-lived; as stated above epoll will be slower than poll in a situation when a new connection sends a few bytes of data and immediately disconnects because of extra system call required to add the descriptor into epoll set;
  • 网络连接的生命周期较长。如前所述,如果一个新的连接只是发送很少的数据,然后就关闭了,那epoll会由于额外系统调用(而导致的内核态切换)变得较慢。
  • Your app depends on other Linux-specific features (so in case portability question would suddenly pop up, epoll wouldn’t be the only roadblock), or you can provide wrappers for other supported systems. In the last case you should strongly consider libevent.
  • 应用还依赖其他Linux相关特性(所以就不需要考虑移植性了),或者开发者愿意针对其他平台提供适配方法。后者建议使用libevent。

If all the items above aren’t true, you should be better served by using poll instead.


Polling with libevent

libebent is a library which basically wraps the polling methods listed in this article (and some others) in an uniform API.Its main advantage is that it allows you to write the code once and compile and run it on many operating systems without the need to change the code. It is important to understand that libevent it is just a wrapper built on top of the existing polling methods, and therefore it inherits the issues those polling methods have. It will not make select supporting more than 1024 sockets on Linux or allow epoll to modify the polling events without a syscall/context switch. Therefore it is still important to understand each method’s pros and cons.


Having to provide access to the functionality from the dramatically different methods, libevent has a rather complex API which is much more difficult to use than poll or even epoll. It is however easier to use libevent than to write two separate backends if you need to support FreeBSD (epoll and kqueue). Hence it is a viable alternative which should be considered if:


  • Your application requirements indicate that you must use epoll, and using just poll would not be enough (ifpoll would satisfy your needs, it is extremely unlikely libevent would offer you any benefits)
  • 应用强烈需要用epoll(如果poll就够用了,那无需使用libevent)。
  • You need to support other OS than Linux, or may expect such need to arise in future. Again, this depends on other features of your application – if it is tied up to many other Linux-specific things you’re not going to achieve anything by using libevent instead of epoll
  • 现在或未来需要支持Linux和其他更多操作系统。当然这也依赖于应用所需的其他功能,如果很多linux特殊功能,那libevent也无法全部解决。



1  用户命令,  可由任何人启动的。
2  系统调用,  即由内核提供的函数。
3  例程,   即库函数。
4  设备,   即/dev目录下的特殊文件。
5  文件格式描述,  例如/etc/passwd。
6  游戏,   不用解释啦!
7  杂项,   例如宏命令包、惯例等。
8  系统管理员工具, 只能由root启动。
9  其他(Linux特定的), 用来存放内核例行程序的文档。
n  新文档,  可能要移到更适合的领域。
o  老文档,  可能会在一段期限内保留。
l  本地文档,  与本特定系统有关的。