Archive for 三月, 2015

以Python Imaging Library(PIL)为例,进行安装。限制条件是:

  • 没有写python lib所在路径的权限
  • 服务器jpeg so包不在默认路径,而是/usr/lib64下

如果按照默认方法安装,会报错:IOError: decoder jpeg not available。因为setup.py时没有关联到其依赖的图片库。

1. 下载源代码包,并解压

在http://www.pythonware.com/products/pil/ 选择相应版本,我选择的是Python Imaging Library 1.1.7 Source Kit (all platforms) (November 15, 2009)

wget “http://effbot.org/downloads/Imaging-1.1.7.tar.gz”; tar -xzvf  Imaging-1.1.7.tar.gz; cd Imaging-1.1.7

2. 修改setup.py,使其能够找到libjpeg.so包

# Use None to look for the libraries in well-known library locations.
# Use a string to specify a single directory, for both the library and
# the include files. Use a tuple to specify separate directories:
# (libpath, includepath).

JPEG_ROOT = (“/usr/lib64″, “/usr/include”)

3. 指定路径安装(注意,该路径随意,只要你有读写权限即可)

python setup.py install –home /path/to/your/pythonlib/

关注其最初的output:

*** TKINTER support not available
— JPEG support available
— ZLIB (PNG/ZIP) support available
— FREETYPE2 support available
*** LITTLECMS support not available

4. 将pythonlib路径加入PYTHONPATH,注意要深入到lib/python里

$ tail -1 ~/.bash_profile

export PYTHONPATH=/path/to/your/pythonlib/lib/python/:$PYTHONPATH

completed jobs 日志

公司集群允许查看已完成job的列表,但默认配置下,点击进去,显示:

No event logs were found for this application! To enable event logging, set spark.eventLog.enabled to true and spark.eventLog.dir to the directory to which your event logs are written.

当设置spark UI配置的spark.eventLog.enabled为True,并且设置spark.eventLog.dir为本地文件路径后,点击进去,变成了:

No event logs found for application chengyi02_test_load_1GB in file:/home/users/chengyi02/spark-resource/test_perf/log//chengyi02_test_load_1gb-1427419993180. Did you specify the correct logging directory?

原因是,我的本地路径当然是无法被history server读取的了。不过这时查看该路径下,已经有日志了,看起来像json格式。

再次修改,把spark.eventLog.dir改为hdfs路径,运行spark报错。经与管理员沟通,得知是由于该spark集群的history server没有开导致的

15/03/27 09:50:32 WARN ServletHandler: /stages/

java.lang.NullPointerException

        at org.apache.spark.SparkContext.getAllPools(SparkContext.scala:892)

        at org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:50)

        at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)

        at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)

        at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70)

        at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)

        at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)

        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)

        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)

        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)

        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)

        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)

        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

        ……

partitionBy导致的too many value to unpack

http://stackoverflow.com/questions/7053551/python-valueerror-too-many-values-to-unpack

HDFS文件读写性能调优

第一次测试:60GB的数据,1108个文件,大小从20m-100m+不等。 max cores我设置为160,启动了1108个Tasks,貌似只用了5台服务器。 读入的耗时1.3min,写入的耗时我执行了两次,一次1.75min,一次8min(被hang在一个task上了)

第二次测试:max cores提升为1000,并设置spark.speculation为True,以便re-lauch slow tasks。 读入耗时提升为29s,写入耗时提升为41s。效果比较明显!由于文件数量没有变化,故仍然是1108个Tasks,但用了32台服务器。

第三次测试:max cores提升为1000,并设置spark.speculation为True。把文件大小统一为165M左右,共计359个文件,60GB数据。读入耗时40s,写入耗时55s。由于tasks数降低了,所以整体耗时反而下降。

第四次测试:max cores恢复为160,也不设置spark.speculation。把文件大小统一为165M左右,共计359个文件,60GB数据。读入耗时51s,写入耗时442s(被2个tasks hang住了)。所以设置spark.speculation还是非常重要的!

第五次测试:3.TB的数据,15846个文件,大小从27M-260MB不等,max cores设置为1w。启动时使用了32台服务器,每个上面并发32个tasks,所以整体并发度是1664。后面增加到了50+台。然后就把spark跑到FULL GC了。

java.io.IOException: org.apache.hadoop.hdfs.FMSClient$RetryableIOException: Could not obtain block: blk_31525197851421116_494327281

附图如下:

第六次测试,3.TB的数据,15846个文件,大小从27M-260MB不等,max cores设置为1w,但spark版本改为1.2。仍然报错:

错误类型1:

15/03/27 14:56:56 ERROR DAGScheduler: Failed to update accumulators for ResultTask(0, 4815)

java.net.SocketException: Broken pipe

错误类型2:

java.io.IOException: org.apache.hadoop.hdfs.FMSClient$RetryableIOException: Could not obtain block: blk_31525197851421116_494327281

at org.apache.hadoop.hdfs.FMSClient$DFSInputStream.read(FMSClient.java:2563)

……

以上第五第六次测试报错的原因,可能都是HDFS所在服务器有坏道导致的。

第七次测试,spark 1.2。1.3TB数据,共计114个文件,大小从几百字节到57G不等,且大文件居多。max cores设置1w,spark.speculation设置为True。启动了5353个Tasks,读入耗时5.1min!

第八次测试,同第七次,但使用spark 1.1。读入耗时3.6 min

读取没有权限的HDFS路径

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

修改application所在服务器的spark/conf/hadoop-site.xml里的UGI配置为正确的用户名密码即可。

broadcast an RDD in PySpark

>>>data = sc.textFile(‘/app/ecom/aries/sf_public/chengyi02/uclogin/20141120/cq02-dr-uclogin206.cq02/*’)

>>>sc.broadcast(data)

报错:py4j.Py4JException: Method __getnewargs__([]) does not exist

如果改成sc.broadcast(data.collectAsMap())就不报错了

解释:broadcast的必须是一个kv对 这个地方的colletAsMap不是做collect操作 只是把那个textFile的RDD转成kv对发出去

读取hdfs sequenceFile

Job aborted due to stage failure: Task 8 in stage 0.0 failed 4 times, most recent failure: Lost task 8.3 in stage 0.0 (TID 47, nmg01-spark-a0033.nmg01.baidu.com): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 64
        com.esotericsoftware.kryo.io.Output.require(Output.java:138)
        com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420)
        com.esotericsoftware.kryo.io.Output.writeString(Output.java:326)
        com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274)
        com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262)
        com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:156)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
        java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        java.lang.Thread.run(Thread.java:662)
Driver stacktrace:

任务非正常关闭

SparkDeploySchedulerBackend: Asked to remove non-existent executor

删除 /tmp/spark-*   ,  /tmp/fetchFileTemp*

任务启动失败

使用spark-submit启动任务失败,最下面的报错如下:

363 15/03/30 09:51:17 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
364 15/03/30 09:51:17 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED

原因是该job的tasks数量过大,设置为1w,而集群整体的Cores: 2272 Total, 472 Used。所以如果max cores > 2272,就会被杀掉!如果大于unused cores数量,就会先抢占可用的,然后剩余的tasks再排队。

再往上有如下日志:

353 15/03/30 09:51:17 INFO AppClient$ClientActor: Executor updated: app-20150330095112-25033/62 is now EXITED (Command exited with code 1)

根据task号,在日志里找到其对应的服务器:

345 15/03/30 09:51:16 INFO AppClient$ClientActor: Executor added: app-20150330095112-25033/62 on worker-20141125192713-nmg01-spark-a0062.nmg01.xxx.com-11543 (nmg01-spark-a0062.nmg01.xxx.com:11543) with 32 cores

在该spark集群监控页面,找到对应的worker,在Finished Executors里发现有大量failed tasks, 找到我的task,查看其对应的stdout日志:

Error occurred during initialization of VM
Could not reserve enough space for object heap

所以,这些可能是部分节点OOM,但在大部分情况下spark可用通过重跑或重试,避免问题。

Out of Memory

在调用takeSample对一个大list做抽样,并collect回来时,报错如下:其中飘红的两个节点,分别是一个worker,以及application所在服务器。

15/03/30 17:35:15 WARN DefaultChannelPipeline: An exception was thrown by a user handler while handling an exception event ([id: 0x22dfe421, /10.75.65.12:12237 => /10.48.23.31:30001] EXCEPTION: java.lang.OutOfMemoryError: Java heap space)

java.lang.OutOfMemoryError: Java heap space

at java.lang.Object.clone(Native Method)

at akka.util.CompactByteString$.apply(ByteString.scala:410)

at akka.util.ByteString$.apply(ByteString.scala:22)

at akka.remote.transport.netty.TcpHandlers$class.onMessage(TcpSupport.scala:45)

at akka.remote.transport.netty.TcpServerHandler.onMessage(TcpSupport.scala:57)

 

broadcast大数据

文件大小是1GB左右,通过textFile().collect()后,占用xxxx内存。

配置有:

 

问题1:

spark-submit提交的任务退出,但没有错误提示,history UI 列表里看到任务是finish了,但点击进去发现还有active的job。

进入到executor所在UI列表页,找到该app对应的stderr,查看如下:

15/04/07 13:07:47 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@nmg01-taihang-d10538.nmg01.baidu.com:54765] -> [akka.tcp://sparkDriver@cq01-rdqa-dev006.cq01.baidu.com:16313] disassociated! Shutting down.
15/04/07 13:07:47 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@cq01-rdqa-dev006.cq01.baidu.com:16313] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

根据http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-3106-fixed-td16264.html#a16316 ,修改cf.set(‘spark.akka.heartbeat.interval’, 10000)无效。

本机driver的问题:

$dmesg

lowest memcgroup 19

[chengyi02@cq01-rdqa-dev006.cq01.baidu.com test_perf]$ strace -p 19970
Process 19970 attached – interrupt to quit
futex(0x40e4e9f0, FUTEX_WAIT, 19974, NULL

跑python unittest

参考spark/python/run-tests可以看到:

SPARK_TESTING=1  /path/to/spark-*-client/bin/pyspark tests/schedule_frame_test.py

 

安装SWIG

mac下非常简单:brew install swig。其他平台可以在http://www.swig.org/下载安装应该也不难。

Python调用C方法

1. 准备一个简单的C文件 palindrome.c:

#include <string.h>

 

/**

* return: 0 — not palindrome

*         1 — is  palindrome

*/

int is_palindrome(char* text)

{

int i, n = strlen(text);

 

for (i=0; i<=n/2; i++) {

if (text[i] != text[n-i-1]) {

return 0;

}

}

 

return 1;

}

2. 再按照SWIG要求,准备一个.i文件(类似.h) palindrome.i:

%module palindrome

 

%{

#include <string.h>

%}

extern int is_palindrome(char* text);

3. 调用SWIG生成包装器,会生成两个新文件palindrome.py     palindrome_wrap.c

$ swig -python palindrome.i

4. 编译为so包

gcc -shared -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -L/System/Library/Frameworks/Python.framework/Versions/2.7/ *.c -lpython2.7 -o _palindrome.so

(如果不知道python安装路径,可以python -c “import sys;import pprint; pprint.pprint(sys.path);”查看一下)

5. 在python里调用palindrome_test.py:

import _palindrome

print dir(_palindrome)

mystr1 = “nopalindrome”

mystr2 = “ipreferpi”

print _palindrome.is_palindrome(mystr1)

print _palindrome.is_palindrome(mystr2)

 

执行

1. 交互式

/path/to/spark/bin/pyspack

2. 批处理:

/path/to/spark/bin/spark-submit ~/spark-resource/spark-training/spark/examples/src/main/python/pi.py

 

Basic RDDs

element-wise transformations

  • map
  • filter
  • flatMap

pseudo set operations transformations

要求所操作的RDDs含有相同类型的元素。

  • distinct,expensive!
  • union,不去重
  • intersection,会去重,expensive!
  • subtract,expensive!
  • cartesion,expensive!

actions

  • reduce
  • fold
  • aggregate
  • collect,会将所有结果返回到driver,所以结果集需要能保存在一台服务器的内存里。可以作用于list返回list,dict返回keys list。
  • take
  • top
  • takeSample
  • foreach,针对每个element应用func,没有返回值
  • count
  • countByValue
  • takeOrdered
需要注意:不像Scala和Java,除了Base RDD,还有DoubleRDD之类,Python只有base rdd。所有的rdd 方法都作用于base rdd,但如果其内含的数据类型不正确,就直接挂掉了。

 persistence

  • persist = cache。对应的level是 pyspark.StorageLevel.MEMORY_ONLY_SER and so on
  • unpersist
  • is_cached,属性,不是方法

 Key/Value pairs = pair RDDs

注意,python spark里的pair RDD是指 list(tuples),而非直接处理dict。例如 [ (‘k1′, ‘v1′), (‘k2′, ‘v2′) ]。

create pair RDDs

  • pairs = lines.map(lambda line: (line.split(” “)[0], line)) ,从一个文本里生成,以第一个word为key
  • data = sc.parallelize( [ (‘k1′, ‘v1′), (‘k2′, ‘v2′) ] ),从in-memory的dict里parallelize生成
  • 直接从文件里读取(TODO

transformations on pair RDDs

  • 上面列举的,可以针对base RDDs的transformations都可以用在pair RDDs上面。Pair RDD每个item是一个tuple(key, val),所以t[0]是key,t[1]是value。
transformations on one pair RDD
  • reduceByKey
  • groupByKey
  • combineByKey, createCombiner是在每个partition首次遇到一个key时被调用的,所以在整个数据集上会多次调用
  • mapValues
  • flatMapValues
  • keys
  • values
  • sortByKey
  • foldByKey

transformations on two pair RDDs

  • subtractByKey
  • join, inner join, only keys that are present in both pair RDDs are output
  • rightOuterJoin
  • leftOuterJoin 
  • cogroup

 actions on pair RDDs

  • 上面列举的可以针对base RDD的actions,也都可以应用于pair RDD
  • countByKey
  • collectAsMap,可以作用于 list( tuple, tuple ), 返回 dict。 不可作用于dict、非tuple的list
  • lookup

针对pari RDDs join操作的调优

  • partitionBy,将大数据集partition到“固定”的服务器上 ,之后再与小数据集join的时候,就不用分发大数据集了。spark会将小数据集按照相同的方式分发过去。是一个transformation,第二个参数partitionFunc可以用来控制如何进行分片。
  • 对于pair RDDs,当不改变key时,尽量使用mapValues和flatMapValues,以保持partition(虽然也可以用map模拟这些操作,但spark不会分析map的func,所以就无法维持partition了)
  • partition会极大影响性能,等在实战中积累经验之后,再来补充(TODO

 文件读取

  • python的saveAsSequenceFile,可以处理的是 tuple的list,例如:sc.parallelize([ (‘k1′, ‘v1′), (‘k2′, ‘v2′) ]).saveAsSequenceFile(‘…’)。如果处理dict、非tuple的list,则会报错:RDD element of type java.lang.String cannot be used。另外,如果传入的keyClass、valueClass与类型不匹配,则会默认被当做string处理。
Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system
……
  1. A Java RDD is created from the SequenceFile or other InputFormat, and the key and value Writable classes
  2. Serialization is attempted via Pyrolite pickling
  3. If this fails, the fallback is to call ‘toString’ on each key and value
  4. PickleSerializer is used to deserialize pickled objects on the Python side
  •  saveAsNewAPIHadoopFile,也是处理list(tuples)。可以用于protobuf格式的数据。

shared variables

  • 可以broadcast list、dict等类型的数据,用b_var.value读取其值
  • action阶段的accumulator 可以保证fault tolerant时只计算一次,transformation阶段的不保证

spark transformations:

  • map(func) , and func is:  mixed func(x)

example: distData.map(lambda x: (x, 1))  是用python的lambda构造了一个匿名函数,接受map的每一行作为输入x,并返回(x, 1)。所以,这条语句的作用是,针对每一行x,转换为(x, 1)。

  • filter(func), and func is: bool func(x)

example:  f.filter(lambda x: x.find(“spark”) != -1) 是在f对应的RDD里,找包含spark这个单词的行。

  • flatMap(func), and func is: Seq func(x)

example:  f.flatMap(lambda x: x.split(” “)).collect() 是用空格符分隔输入的每一行,这样一个flatMap输入行会对应到多个输出行,所以返回的是Seq。注意,以上语句的output每一行是一个单词,如果改为f.map(lambda x: x.split(” “)).collect() 则output每一行是一个seq(word)。

  • sample(withReplacement, fraction, seed) 该method接受3个参数,分别是bool、float和int型。

example: f.flatMap(lambda x: x.split(” “)).sample(withReplacement = True, fraction = 0.01, seed = 181341).collect()

  • union(otherDataSet), 貌似otherDataSet也需要是一个RDD,而不能是一个普通的array

example:

new_edges = tc.join(edges).map(lambda (_, (a, b)): (b, a))
tc = tc.union(new_edges).distinct().cache()

  • distinct([numTasks]), 去重

example: f.flatMap(lambda x: x.split(” “)).distinct()

  • groupByKey([numTasks]),类似sql的group by,将(k, v) => (k, seq(v))
  • reduceByKey(func, [numTasks]),类似MR的reduce,针对相同key的所有val,循环调用func方法

example:

contribs.reduceByKey(add) // 直接用python add作为func使用了

f.reduceByKey(lambda x, _: x) // 生成了一个匿名函数,作为func使用

pointStats = closest.reduceByKey(lambda (x1, y1), (x2, y2): (x1 + x2, y1 + y2)) // 生成了一个匿名函数,作为func使用

  • sortByKey([ascending], [numTasks])
  • join(otherDataSet, [numTasks])
  • cogroup(otherDataSet, [numTasks])
  • cartesian(otherDataSet),类似php的array_combine

Spark Actions:

  • reduce(func), func is: mixed  func(mixed first, mixed second)
  • collect()
  • count()
  • first(), 等同于调用take(1)
  • take(n),注意:当前非并行处理,而是由driver program自行计算的
  • takeSample(withReplacement, fraction, seed)
  • saveAsTextFile(path),path是directory路径,针对每一个element会调用toString转化为字符串进行写入
  • saveAsSequenceFile(path),path是directory路径,Only available on RDDs of key-value pairs that either implement Hadoop’s Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
  • countByKey()
  • foreach(func)

pyspark package

  • parallelize(c, numSlices=None),numSlices控制启动几个tasks并行计算

example:count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)

Python help

>>> help (‘operator’)

>>> help(‘operator.add’)

 

在我们的项目里,需要实时采集web服务器产生的日志,进行分析。由于PV较高,且每个PV会产生多条日志,故系统整体吞吐量较大。由于第一手日志的读写都是由web服务器完成的,故本文主要关注PHP程序写日志,以及logagent进程实时读和网络推送日志,是否会对性能产生影响。

C函数与缓存

PHP内核是由C实现的,所以要理解其写日志的性能,首先需要了解相关的C函数。C函数与缓存的关系如下图,其中application memory和clib buffer都是在当前进程的用户内存空间,而page cache是内核内存空间,为多进程共享。

需要关注的几个Notes:

  • Note that fclose only flushes the user space buffers provided by the C library. To ensure that the data is physically  stored  on disk the kernel buffers must be flushed too, e.g. with sync(2) or fsync(2).
  • Note  that  fflush() only flushes the user space buffers provided by the C library.  To ensure that the data is physically stored on disk the kernel buffers must be flushed too, e.g. with sync(2) or fsync(2).
  • fsync  copies  all  in-core parts of a file to disk, and waits until the device reports that all parts are on stable storage.  It also updates metadata stat information. It does not necessarily ensure that the entry in the directory containing  the  file  has also reached disk.  For that an explicit fsync on the file descriptor of the directory is also needed.
  • In case the hard disk has write cache enabled, the data may not really be on permanent storage when fsync/fdatasync return.When an ext2 file system is mounted with the sync option, directory entries are also implicitly synced by fsync. On kernels before 2.4, fsync on big files can be inefficient.  An alternative might be to use the O_SYNC flag to open(2).
  • A successful return from write does not make any guarantee that data has been committed to disk.  In fact, on some  buggy  implementations(注:一种针对较大块内存分配算法),  it  does  not  even guarantee that space has successfully been reserved for the data(注:可能会被LRU掉).  The only way to be sure is to call fsync(2) after you are done writing all your data.
  • Opening a file with O_APPEND set causes each write on the file to be appended to the end. (注:对于单次写,有原子性保证)

 PHP对C函数调用

通过对PHP的fwrite源码和gdb跟入,确定针对普通本地文件的调用链如上图所示,即并未使用标准Lib库的IO函数,而是直接调用了系统函数open、write,每次调用都需要经历用户态与内核态的切换,有一定的资源消耗,而且也没用使用到C lib库的cache层。

同时,还需要注意的是,虽然系统write函数保证了原子性,但PHP对其进行了封装,根据stream的chunk_size拆分为多个write调用。故在buf字节数超过chunk_size的情况下,无法保证原子性。其中chunk_size在main/php_network.h中定义:#define PHP_SOCK_CHUNK_SIZE 8192。

再看下file_put_contents如下图,多次调用系统函数进行open、可选的flock、write、fflush和close操作。其中可能破坏原子性的地方有两处:

  • 若传入一维数组,会针对每一行调用php_stream_write方法
  • 每次php_stream_write方法,若待写入数据超过chunk_size,会分多次调用write()

另外,常用的error_log方法,在目标文件是普通本地文件时,也只是对php_stream_open_wrapper、php_stream_write、php_stream_close的封装。

写日志实现方式

基本思路见 封装php的Log类,考虑3种方式:

  • 每次直接调用file_put_contents
  • 每次直接调用fwrite
  • 每次先写入应用层缓存,触发条件或request结束时批量file_put_contents

完整实现代码已上传至:https://github.com/flykobe/php-log

性能测试

小块日志写入

3种调用方式,分别调用10w次,每行日志1024B,总体写入1GB数据。在mac本上,php 5.3.24,多次运行,平均测试结果:

_fwrite 2647.6ms

_write_with_buf 2659.9ms

_file_put_contents 10322.6ms

可以看到 耗时:_file_put_contents > _write_with_buf > _fwrite。

大块日志写入

3种调用方式,分别调用10w次,每行日志8192B,总体写入8GB数据。在mac本上,php 5.3.24,多次运行,平均测试结果:

_fwrite 4300

_write_with_buf 5575

_file_put_contents 11600

_fwrite和_write_with_buf的耗时上涨比例都较大。且两者差距拉大,主要由于此时_write_with_buf会加锁,以保证原子性。_file_put_contents反而变化比例不大。

优化_write_with_buf如下:

  • self::$_arrBuf[$strType][] = &$strMsg; // 将buf数组改为引用的方式,以避免PHP内核进行数据copy。没啥效果,甚至耗时还略微上涨
  • self::$_intBufMaxNum = $arrConfig[‘writebufmax’]; // 传入writebufmax = 100, 默认值是20行。略微有效果,下降200ms左右,但多次运行波动较大。增大该值,还会获得微小提升,但对于web系统而言,不太实用。

原子性测试

同时运行两个./atomic_test.php 100000 8196  脚本,分别写入8196个a或b。使用perl进行验证,a和b不能同时出现在一行里:

perl -ne ‘$ma = m#1234567\t.*a#; $mb = m#1234567\t.*b#; if ($ma && $mb){print;}’ log/20150317/notice.log | wc -l

发现有大量错行的情况!而如果将写入字符串大小改为1024、4096等,即使加上自动添加的日志信息,也远小于chunk_size,则没有错行情况。

使用_write_with_buf,写入大小为8196B的日志,没有错行!改为4096,同样没有错行。

使用_file_put_contents,写入大小为8196B的日志,也有错行!改为4096B,同样没有错行。

即,如果日志大小(包括填充的前后数据)超过chunk_size,则仅依赖PHP底层无法保证原子性!而_write_with_buf在应用层按需向file_put_contents传入LOCK_EX参数加锁,故保证了原子性。

持久化考虑

在默认情况下,系统调用write()仅是写入到内核的page cache,等待调度被刷到磁盘上。如果在此期间出现系统或硬件故障,可能会导致数据丢失。如果在单机情况下,要保证100%的可靠,则需要手工调用fsync(),或在open()时,设置O_DIRECT flag。但这两者,无疑都是较为耗时的,因为写入耗时里真正包含了到磁盘的时间。在log类里,一般不会达到这么高的要求,所以不予开启。

参考资料

http://www.ibm.com/developerworks/cn/linux/l-cn-directio/

http://www.dbabeta.com/2009/io-performence-02_cache-and-raid.html

http://php.net/manual/zh/function.stream-set-write-buffer.php

http://blog.chinaunix.net/uid-27105712-id-3270102.html

sar命令

我很喜欢用这个命令来查看/监控系统的整体情况。以下引自http://linux.die.net/man/1/sar,并加以解释。

常用参数组合

查看内存

free, sar -r 1, sar -B 1,ps aux

查看cpu和负载

sar -u ALL 1 , sar -P ALL 1,  sar -q 1 , sar -w 1

查看磁盘

sar -b 1,  sar -d 1, sar -v 1

查看网络

sar -n DEV等

man手册及解释

Name

sar – Collect, report, or save system activity information.

Synopsis

sar [ -A ] [ -b ] [ -B ] [ -C ] [ -d ] [ -h ] [ -i interval ] [ -m ] [ -p ] [ -q ] [ -r ] [ -R ] [ -S ] [ -t ] [ -u [ ALL ] ] [ -v ] [ -V ] [ -w ] [ -W ] [ -y ] [ -n { keyword [,…] | ALL } ] [ -I { int[,…] | SUM | ALL | XALL } ] [ -P { cpu [,…] | ALL } ] [ -o [ filename ] | -f [ filename ] ] [ -s [ hh:mm:ss ] ] [ -e [hh:mm:ss ] ] [ interval [ count ] ]

Description

The sar command writes to standard output the contents of selected cumulative activity counters in the operating system. The accounting system, based on the values in the count and interval parameters, writes information the specified number of times spaced at the specified intervals in seconds. If the interval parameter is set to zero, the sar command displays the average statistics for the time since the system was started. If the interval parameter is specified without the count parameter, then reports are generated continuously. The collected data can also be saved in the file specified by the -o filename flag, in addition to being displayed onto the screen. If filename is omitted, sar uses the standard system activity daily data file, the /var/log/sa/sadd file, where the dd parameter indicates the current day. By default all the data available from the kernel are saved in the data file.

The sar command extracts and writes to standard output records previously saved in a file. This file can be either the one specified by the -f flag or, by default, the standard system activity daily data file.

Without the -P flag, the sar command reports system-wide (global among all processors) statistics, which are calculated as averages for values expressed as percentages, and as sums otherwise. If the -P flag is given, the sar command reports activity which relates to the specified processor or processors. If -P ALL is given, the sar command reports statistics for each individual processor and global statistics among all processors.  可以查看所有CPU的整体情况(不带-P参数),也可以查看指定处理器(-P CPU-NUM),还可以查看指定CPU和汇总情况(-P ALL)。对于应用层而言,可以查看程序是否均衡使用了多CPU并行能力。

You can select information about specific system activities using flags. Not specifying any flags selects only CPU activity. Specifying the -A flag is equivalent to specifying -bBdqrRSvwWy -I SUM -I XALL -n ALL -u ALL -P ALL.

The default version of the sar command (CPU utilization report) might be one of the first facilities the user runs to begin system activity investigation, because it monitors major system resources. If CPU utilization is near 100 percent (user + nice + system), the workload sampled is CPU-bound.

If multiple samples and multiple reports are desired, it is convenient to specify an output file for the sar command. Run thesar command as a background process. The syntax for this is:

sar -o datafile interval count >/dev/null 2>&1 &

All data is captured in binary form and saved to a file (datafile). The data can then be selectively displayed with the sarcommand using the -f option. Set the interval and count parameters to select count records at interval second intervals. If the count parameter is not set, all the records saved in the file will be selected. Collection of data in this manner is useful to characterize system usage over a period of time and determine peak usage hours.

Note: The sar command only reports on local activities.

Options

-A

This is equivalent to specifying -bBdqrRSuvwWy -I SUM -I XALL -n ALL -u ALL -P ALL.

-b

Report I/O and transfer rate statistics. The following values are displayed:  物理存储介质的I/O监控。

tps

Total number of transfers per second that were issued to physical devices. A transfer is an I/O request to a physical device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.

对物理存储每秒发起的读写请求数目。以下细分为读请求数目rtps,写请求数目wtps。

rtps

Total number of read requests per second issued to physical devices.

wtps

Total number of write requests per second issued to physical devices.

bread/s

Total amount of data read from the devices in blocks per second. Blocks are equivalent to sectors with 2.4 kernels and newer and therefore have a size of 512 bytes. With older kernels, a block is of indeterminate size.

每秒读入的blocks数目,2.4内核及以上,block size = sector size = 512B。下面可以看到写blocks数目。

bwrtn/s

Total amount of data written to devices in blocks per second.

-B

Report paging statistics. Some of the metrics below are available only with post 2.5 kernels. The following values are displayed: 页交换相关数据。

pgpgin/s

Total number of kilobytes the system paged in from disk per second. Note: With old kernels (2.2.x) this value is a number of blocks per second (and not kilobytes).

2.2内核以上,是每秒从磁盘换入的数据量,以KB为单位。

pgpgout/s

Total number of kilobytes the system paged out to disk per second. Note: With old kernels (2.2.x) this value is a number of blocks per second (and not kilobytes).

2.2内核以上,是每秒从磁盘换出的数据量,以KB为单位。

fault/s

Number of page faults (major + minor) made by the system per second. This is not a count of page faults that generate I/O, because some page faults can be resolved without I/O.

每秒系统产生的页中断数目。注意,页中断不一定导致I/O。下面的major faults才必定会导致从disk加载数据到memory。

majflt/s

Number of major faults the system has made per second, those which have required loading a memory page from disk.

每秒系统产生的major中断数目,会导致从disk加载内存页。

pgfree/s

Number of pages placed on the free list by the system per second.

每秒系统释放的空闲内存页数目。

pgscank/s

Number of pages scanned by the kswapd daemon per second.

每秒被kswapd扫描的内存页数目。

pgscand/s

Number of pages scanned directly per second.

每秒被直接扫描的内存页数目。

pgsteal/s

Number of pages the system has reclaimed from cache (pagecache and swapcache) per second to satisfy its memory demands.

每秒系统认为内存不足的次数,包括页内存和swap内存。

%vmeff

Calculated as pgsteal / pgscan, this is a metric of the efficiency of page reclaim. If it is near 100% then almost every page coming off the tail of the inactive list is being reaped. If it gets too low (e.g. less than 30%) then the virtual memory is having some difficulty. This field is displayed as zero if no pages have been scanned during the interval of time.

-C

When reading data from a file, tell sar to display comments that have been inserted by sadc.

-d

Report activity for each block device (kernels 2.4 and newer only). When data is displayed, the device specificationdev m-n is generally used ( DEV column). m is the major number of the device. With recent kernels (post 2.5), n is the minor number of the device, but is only a sequence number with pre 2.5 kernels. Device names may also be pretty-printed if option -p is used (see below). Values for fields avgqu-sz, await, svctm and %util may be unavailable and displayed as 0.00 with some 2.4 kernels. Note that disk activity depends on sadc options “-S DISK” and “-S XDISK” to be collected. The following values are displayed:

tps

Indicate the number of transfers per second that were issued to the device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.

与-b选项的tps含义相同。

rd_sec/s

Number of sectors read from the device. The size of a sector is 512 bytes.

与-b选项的bread/s含义相同。

wr_sec/s

Number of sectors written to the device. The size of a sector is 512 bytes.

与-b选项的wread/s含义相同。

avgrq-sz

The average size (in sectors) of the requests that were issued to the device.

每个请求的平均传输数据量,以sectors为单位。sector size = 512B。

avgqu-sz

The average queue length of the requests that were issued to the device.

请求排队的平均长度。如果出现磁盘等硬件问题,排队长度极大可能性会剧烈上升。

await

The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.

请求平均处理时间,单位是ms。包含了排队时间和真实的处理时间。

svctm

The average service time (in milliseconds) for I/O requests that were issued to the device.

请求平均的真实处理时间,单位是ms。与await相减就可以得到排队时间了。在我们的hbase服务器上,await – svctm会达到3ms+-,而负载较轻的web服务器只有0.0几ms。

%util

Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

CPU时间中用于IO耗时的占比。越趋近100%,代表设备使用越饱和。

-e [ hh:mm:ss ]

Set the ending time of the report. The default ending time is 18:00:00. Hours must be given in 24-hour format. This option can be used when data are read from or written to a file (options -f or -o ).

-f [ filename ]

Extract records from filename (created by the -o filename flag). The default value of the filename parameter is the current daily data file, the /var/log/sa/sadd file. The -f option is exclusive of the -o option.

-h

Display a short help message then exit.

-i interval

Select data records at seconds as close as possible to the number specified by the interval parameter.

-I { int [,…] | SUM | ALL | XALL }

Report statistics for a given interrupt. int is the interrupt number. Specifying multiple -I int parameters on the command line will look at multiple independent interrupts. The SUM keyword indicates that the total number of interrupts received per second is to be displayed. The ALL keyword indicates that statistics from the first 16 interrupts are to be reported, whereas the XALL keyword indicates that statistics from all interrupts, including potential APIC interrupt sources, are to be reported. Note that interrupt statistics depend on sadc option “-S INT” to be collected.

-m

Report power management statistics. Note that these statistics depend on sadc option “-S POWER” to be collected. The following value is displayed:

MHz

CPU clock frequency in MHz.

-n { keyword [,…] | ALL }

Report network statistics.

Possible keywords are DEVEDEVNFSNFSDSOCKIPEIPICMPEICMPTCPETCPUDPSOCK6IP6,EIP6ICMP6EICMP6 and UDP6.

With the DEV keyword, statistics from the network devices are reported. The following values are displayed:

监控网络情况,简单使用方式:sar -n DEV 1 1

IFACE

Name of the network interface for which statistics are reported.

rxpck/s

Total number of packets received per second.

txpck/s

Total number of packets transmitted per second.

rxkB/s

Total number of kilobytes received per second.

txkB/s

Total number of kilobytes transmitted per second.

rxcmp/s

Number of compressed packets received per second (for cslip etc.).

txcmp/s

Number of compressed packets transmitted per second.

rxmcst/s

Number of multicast packets received per second.

With the EDEV keyword, statistics on failures (errors) from the network devices are reported. The following values are displayed:

IFACE

Name of the network interface for which statistics are reported.

rxerr/s

Total number of bad packets received per second.

txerr/s

Total number of errors that happened per second while transmitting packets.

coll/s

Number of collisions that happened per second while transmitting packets.

rxdrop/s

Number of received packets dropped per second because of a lack of space in linux buffers.

txdrop/s

Number of transmitted packets dropped per second because of a lack of space in linux buffers.

txcarr/s

Number of carrier-errors that happened per second while transmitting packets.

rxfram/s

Number of frame alignment errors that happened per second on received packets.

rxfifo/s

Number of FIFO overrun errors that happened per second on received packets.

txfifo/s

Number of FIFO overrun errors that happened per second on transmitted packets.

With the NFS keyword, statistics about NFS client activity are reported. The following values are displayed:

call/s

Number of RPC requests made per second.

retrans/s

Number of RPC requests per second, those which needed to be retransmitted (for example because of a server timeout).

read/s

Number of ‘read’ RPC calls made per second.

write/s

Number of ‘write’ RPC calls made per second.

access/s

Number of ‘access’ RPC calls made per second.

getatt/s

Number of ‘getattr’ RPC calls made per second.

With the NFSD keyword, statistics about NFS server activity are reported. The following values are displayed:

scall/s

Number of RPC requests received per second.

badcall/s

Number of bad RPC requests received per second, those whose processing generated an error.

packet/s

Number of network packets received per second.

udp/s

Number of UDP packets received per second.

tcp/s

Number of TCP packets received per second.

hit/s

Number of reply cache hits per second.

miss/s

Number of reply cache misses per second.

sread/s

Number of ‘read’ RPC calls received per second.

swrite/s

Number of ‘write’ RPC calls received per second.

saccess/s

Number of ‘access’ RPC calls received per second.

sgetatt/s

Number of ‘getattr’ RPC calls received per second.

With the SOCK keyword, statistics on sockets in use are reported (IPv4). The following values are displayed:

totsck

Total number of sockets used by the system.

tcpsck

Number of TCP sockets currently in use.

udpsck

Number of UDP sockets currently in use.

rawsck

Number of RAW sockets currently in use.

ip-frag

Number of IP fragments currently in use.

tcp-tw

Number of TCP sockets in TIME_WAIT state.

With the IP keyword, statistics about IPv4 network traffic are reported. Note that IPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

irec/s

The total number of input datagrams received from interfaces per second, including those received in error [ipInReceives].

fwddgm/s

The number of input datagrams per second, for which this entity was not their final IP destination, as a result of which an attempt was made to find a route to forward them to that final destination [ipForwDatagrams].

idel/s

The total number of input datagrams successfully delivered per second to IP user-protocols (including ICMP) [ipInDelivers].

orq/s

The total number of IP datagrams which local IP user-protocols (including ICMP) supplied per second to IP in requests for transmission [ipOutRequests]. Note that this counter does not include any datagrams counted in fwddgm/s.

asmrq/s

The number of IP fragments received per second which needed to be reassembled at this entity [ipReasmReqds].

asmok/s

The number of IP datagrams successfully re-assembled per second [ipReasmOKs].

fragok/s

The number of IP datagrams that have been successfully fragmented at this entity per second [ipFragOKs].

fragcrt/s

The number of IP datagram fragments that have been generated per second as a result of fragmentation at this entity [ipFragCreates].

With the EIP keyword, statistics about IPv4 network errors are reported. Note that IPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

ihdrerr/s

The number of input datagrams discarded per second due to errors in their IP headers, including bad checksums, version number mismatch, other format errors, time-to-live exceeded, errors discovered in processing their IP options, etc. [ipInHdrErrors]

iadrerr/s

The number of input datagrams discarded per second because the IP address in their IP header’s destination field was not a valid address to be received at this entity. This count includes invalid addresses (e.g., 0.0.0.0) and addresses of unsupported Classes (e.g., Class E). For entities which are not IP routers and therefore do not forward datagrams, this counter includes datagrams discarded because the destination address was not a local address [ipInAddrErrors].

iukwnpr/s

The number of locally-addressed datagrams received successfully but discarded per second because of an unknown or unsupported protocol [ipInUnknownProtos].

idisc/s

The number of input IP datagrams per second for which no problems were encountered to prevent their continued processing, but which were discarded (e.g., for lack of buffer space) [ipInDiscards]. Note that this counter does not include any datagrams discarded while awaiting re-assembly.

odisc/s

The number of output IP datagrams per second for which no problem was encountered to prevent their transmission to their destination, but which were discarded (e.g., for lack of buffer space) [ipOutDiscards]. Note that this counter would include datagrams counted in fwddgm/s if any such packets met this (discretionary) discard criterion.

onort/s

The number of IP datagrams discarded per second because no route could be found to transmit them to their destination [ipOutNoRoutes]. Note that this counter includes any packets counted in fwddgm/s which meet this ‘no-route’ criterion. Note that this includes any datagrams which a host cannot route because all of its default routers are down.

asmf/s

The number of failures detected per second by the IP re-assembly algorithm (for whatever reason: timed out, errors, etc) [ipReasmFails]. Note that this is not necessarily a count of discarded IP fragments since some algorithms can lose track of the number of fragments by combining them as they are received.

fragf/s

The number of IP datagrams that have been discarded per second because they needed to be fragmented at this entity but could not be, e.g., because their Don’t Fragment flag was set [ipFragFails].

With the ICMP keyword, statistics about ICMPv4 network traffic are reported. Note that ICMPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

imsg/s

The total number of ICMP messages which the entity received per second [icmpInMsgs]. Note that this counter includes all those counted by ierr/s.

omsg/s

The total number of ICMP messages which this entity attempted to send per second [icmpOutMsgs]. Note that this counter includes all those counted by oerr/s.

iech/s

The number of ICMP Echo (request) messages received per second [icmpInEchos].

iechr/s

The number of ICMP Echo Reply messages received per second [icmpInEchoReps].

oech/s

The number of ICMP Echo (request) messages sent per second [icmpOutEchos].

oechr/s

The number of ICMP Echo Reply messages sent per second [icmpOutEchoReps].

itm/s

The number of ICMP Timestamp (request) messages received per second [icmpInTimestamps].

itmr/s

The number of ICMP Timestamp Reply messages received per second [icmpInTimestampReps].

otm/s

The number of ICMP Timestamp (request) messages sent per second [icmpOutTimestamps].

otmr/s

The number of ICMP Timestamp Reply messages sent per second [icmpOutTimestampReps].

iadrmk/s

The number of ICMP Address Mask Request messages received per second [icmpInAddrMasks].

iadrmkr/s

The number of ICMP Address Mask Reply messages received per second [icmpInAddrMaskReps].

oadrmk/s

The number of ICMP Address Mask Request messages sent per second [icmpOutAddrMasks].

oadrmkr/s

The number of ICMP Address Mask Reply messages sent per second [icmpOutAddrMaskReps].

With the EICMP keyword, statistics about ICMPv4 error messages are reported. Note that ICMPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

ierr/s

The number of ICMP messages per second which the entity received but determined as having ICMP-specific errors (bad ICMP checksums, bad length, etc.) [icmpInErrors].

oerr/s

The number of ICMP messages per second which this entity did not send due to problems discovered within ICMP such as a lack of buffers [icmpOutErrors].

idstunr/s

The number of ICMP Destination Unreachable messages received per second [icmpInDestUnreachs].

odstunr/s

The number of ICMP Destination Unreachable messages sent per second [icmpOutDestUnreachs].

itmex/s

The number of ICMP Time Exceeded messages received per second [icmpInTimeExcds].

otmex/s

The number of ICMP Time Exceeded messages sent per second [icmpOutTimeExcds].

iparmpb/s

The number of ICMP Parameter Problem messages received per second [icmpInParmProbs].

oparmpb/s

The number of ICMP Parameter Problem messages sent per second [icmpOutParmProbs].

isrcq/s

The number of ICMP Source Quench messages received per second [icmpInSrcQuenchs].

osrcq/s

The number of ICMP Source Quench messages sent per second [icmpOutSrcQuenchs].

iredir/s

The number of ICMP Redirect messages received per second [icmpInRedirects].

oredir/s

The number of ICMP Redirect messages sent per second [icmpOutRedirects].

With the TCP keyword, statistics about TCPv4 network traffic are reported. Note that TCPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

active/s

The number of times TCP connections have made a direct transition to the SYN-SENT state from the CLOSED state per second [tcpActiveOpens].

passive/s

The number of times TCP connections have made a direct transition to the SYN-RCVD state from the LISTEN state per second [tcpPassiveOpens].

iseg/s

The total number of segments received per second, including those received in error [tcpInSegs]. This count includes segments received on currently established connections.

oseg/s

The total number of segments sent per second, including those on current connections but excluding those containing only retransmitted octets [tcpOutSegs].

With the ETCP keyword, statistics about TCPv4 network errors are reported. Note that TCPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

atmptf/s

The number of times per second TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times per second TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state [tcpAttemptFails].

estres/s

The number of times per second TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state [tcpEstabResets].

retrans/s

The total number of segments retransmitted per second – that is, the number of TCP segments transmitted containing one or more previously transmitted octets [tcpRetransSegs].

isegerr/s

The total number of segments received in error (e.g., bad TCP checksums) per second [tcpInErrs].

orsts/s

The number of TCP segments sent per second containing the RST flag [tcpOutRsts].

With the UDP keyword, statistics about UDPv4 network traffic are reported. Note that UDPv4 statistics depend on sadc option “-S SNMP” to be collected. The following values are displayed (formal SNMP names between square brackets):

idgm/s

The total number of UDP datagrams delivered per second to UDP users [udpInDatagrams].

odgm/s

The total number of UDP datagrams sent per second from this entity [udpOutDatagrams].

noport/s

The total number of received UDP datagrams per second for which there was no application at the destination port [udpNoPorts].

idgmerr/s

The number of received UDP datagrams per second that could not be delivered for reasons other than the lack of an application at the destination port [udpInErrors].

With the SOCK6 keyword, statistics on sockets in use are reported (IPv6). Note that IPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed:

tcp6sck

Number of TCPv6 sockets currently in use.

udp6sck

Number of UDPv6 sockets currently in use.

raw6sck

Number of RAWv6 sockets currently in use.

ip6-frag

Number of IPv6 fragments currently in use.

With the IP6 keyword, statistics about IPv6 network traffic are reported. Note that IPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

irec6/s

The total number of input datagrams received from interfaces per second, including those received in error [ipv6IfStatsInReceives].

fwddgm6/s

The number of output datagrams per second which this entity received and forwarded to their final destinations [ipv6IfStatsOutForwDatagrams].

idel6/s

The total number of datagrams successfully delivered per second to IPv6 user-protocols (including ICMP) [ipv6IfStatsInDelivers].

orq6/s

The total number of IPv6 datagrams which local IPv6 user-protocols (including ICMP) supplied per second to IPv6 in requests for transmission [ipv6IfStatsOutRequests]. Note that this counter does not include any datagrams counted in fwddgm6/s.

asmrq6/s

The number of IPv6 fragments received per second which needed to be reassembled at this interface [ipv6IfStatsReasmReqds].

asmok6/s

The number of IPv6 datagrams successfully reassembled per second [ipv6IfStatsReasmOKs].

imcpck6/s

The number of multicast packets received per second by the interface [ipv6IfStatsInMcastPkts].

omcpck6/s

The number of multicast packets transmitted per second by the interface [ipv6IfStatsOutMcastPkts].

fragok6/s

The number of IPv6 datagrams that have been successfully fragmented at this output interface per second [ipv6IfStatsOutFragOKs].

fragcr6/s

The number of output datagram fragments that have been generated per second as a result of fragmentation at this output interface [ipv6IfStatsOutFragCreates].

With the EIP6 keyword, statistics about IPv6 network errors are reported. Note that IPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

ihdrer6/s

The number of input datagrams discarded per second due to errors in their IPv6 headers, including version number mismatch, other format errors, hop count exceeded, errors discovered in processing their IPv6 options, etc. [ipv6IfStatsInHdrErrors]

iadrer6/s

The number of input datagrams discarded per second because the IPv6 address in their IPv6 header’s destination field was not a valid address to be received at this entity. This count includes invalid addresses (e.g., ::0) and unsupported addresses (e.g., addresses with unallocated prefixes). For entities which are not IPv6 routers and therefore do not forward datagrams, this counter includes datagrams discarded because the destination address was not a local address [ipv6IfStatsInAddrErrors].

iukwnp6/s

The number of locally-addressed datagrams received successfully but discarded per second because of an unknown or unsupported protocol [ipv6IfStatsInUnknownProtos].

i2big6/s

The number of input datagrams that could not be forwarded per second because their size exceeded the link MTU of outgoing interface [ipv6IfStatsInTooBigErrors].

idisc6/s

The number of input IPv6 datagrams per second for which no problems were encountered to prevent their continued processing, but which were discarded (e.g., for lack of buffer space) [ipv6IfStatsInDiscards]. Note that this counter does not include any datagrams discarded while awaiting re-assembly.

odisc6/s

The number of output IPv6 datagrams per second for which no problem was encountered to prevent their transmission to their destination, but which were discarded (e.g., for lack of buffer space) [ipv6IfStatsOutDiscards]. Note that this counter would include datagrams counted in fwddgm6/s if any such packets met this (discretionary) discard criterion.

inort6/s

The number of input datagrams discarded per second because no route could be found to transmit them to their destination [ipv6IfStatsInNoRoutes].

onort6/s

The number of locally generated IP datagrams discarded per second because no route could be found to transmit them to their destination [unknown formal SNMP name].

asmf6/s

The number of failures detected per second by the IPv6 re-assembly algorithm (for whatever reason: timed out, errors, etc.) [ipv6IfStatsReasmFails]. Note that this is not necessarily a count of discarded IPv6 fragments since some algorithms can lose track of the number of fragments by combining them as they are received.

fragf6/s

The number of IPv6 datagrams that have been discarded per second because they needed to be fragmented at this output interface but could not be [ipv6IfStatsOutFragFails].

itrpck6/s

The number of input datagrams discarded per second because datagram frame didn’t carry enough data [ipv6IfStatsInTruncatedPkts].

With the ICMP6 keyword, statistics about ICMPv6 network traffic are reported. Note that ICMPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

imsg6/s

The total number of ICMP messages received by the interface per second which includes all those counted by ierr6/s [ipv6IfIcmpInMsgs].

omsg6/s

The total number of ICMP messages which this interface attempted to send per second [ipv6IfIcmpOutMsgs].

iech6/s

The number of ICMP Echo (request) messages received by the interface per second [ipv6IfIcmpInEchos].

iechr6/s

The number of ICMP Echo Reply messages received by the interface per second [ipv6IfIcmpInEchoReplies].

oechr6/s

The number of ICMP Echo Reply messages sent by the interface per second [ipv6IfIcmpOutEchoReplies].

igmbq6/s

The number of ICMPv6 Group Membership Query messages received by the interface per second [ipv6IfIcmpInGroupMembQueries].

igmbr6/s

The number of ICMPv6 Group Membership Response messages received by the interface per second [ipv6IfIcmpInGroupMembResponses].

ogmbr6/s

The number of ICMPv6 Group Membership Response messages sent per second [ipv6IfIcmpOutGroupMembResponses].

igmbrd6/s

The number of ICMPv6 Group Membership Reduction messages received by the interface per second [ipv6IfIcmpInGroupMembReductions].

ogmbrd6/s

The number of ICMPv6 Group Membership Reduction messages sent per second [ipv6IfIcmpOutGroupMembReductions].

irtsol6/s

The number of ICMP Router Solicit messages received by the interface per second [ipv6IfIcmpInRouterSolicits].

ortsol6/s

The number of ICMP Router Solicitation messages sent by the interface per second [ipv6IfIcmpOutRouterSolicits].

irtad6/s

The number of ICMP Router Advertisement messages received by the interface per second [ipv6IfIcmpInRouterAdvertisements].

inbsol6/s

The number of ICMP Neighbor Solicit messages received by the interface per second [ipv6IfIcmpInNeighborSolicits].

onbsol6/s

The number of ICMP Neighbor Solicitation messages sent by the interface per second [ipv6IfIcmpOutNeighborSolicits].

inbad6/s

The number of ICMP Neighbor Advertisement messages received by the interface per second [ipv6IfIcmpInNeighborAdvertisements].

onbad6/s

The number of ICMP Neighbor Advertisement messages sent by the interface per second [ipv6IfIcmpOutNeighborAdvertisements].

With the EICMP6 keyword, statistics about ICMPv6 error messages are reported. Note that ICMPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

ierr6/s

The number of ICMP messages per second which the interface received but determined as having ICMP-specific errors (bad ICMP checksums, bad length, etc.) [ipv6IfIcmpInErrors]

idtunr6/s

The number of ICMP Destination Unreachable messages received by the interface per second [ipv6IfIcmpInDestUnreachs].

odtunr6/s

The number of ICMP Destination Unreachable messages sent by the interface per second [ipv6IfIcmpOutDestUnreachs].

itmex6/s

The number of ICMP Time Exceeded messages received by the interface per second [ipv6IfIcmpInTimeExcds].

otmex6/s

The number of ICMP Time Exceeded messages sent by the interface per second [ipv6IfIcmpOutTimeExcds].

iprmpb6/s

The number of ICMP Parameter Problem messages received by the interface per second [ipv6IfIcmpInParmProblems].

oprmpb6/s

The number of ICMP Parameter Problem messages sent by the interface per second [ipv6IfIcmpOutParmProblems].

iredir6/s

The number of Redirect messages received by the interface per second [ipv6IfIcmpInRedirects].

oredir6/s

The number of Redirect messages sent by the interface by second [ipv6IfIcmpOutRedirects].

ipck2b6/s

The number of ICMP Packet Too Big messages received by the interface per second [ipv6IfIcmpInPktTooBigs].

opck2b6/s

The number of ICMP Packet Too Big messages sent by the interface per second [ipv6IfIcmpOutPktTooBigs].

With the UDP6 keyword, statistics about UDPv6 network traffic are reported. Note that UDPv6 statistics depend on sadc option “-S IPV6″ to be collected. The following values are displayed (formal SNMP names between square brackets):

idgm6/s

The total number of UDP datagrams delivered per second to UDP users [udpInDatagrams].

odgm6/s

The total number of UDP datagrams sent per second from this entity [udpOutDatagrams].

noport6/s

The total number of received UDP datagrams per second for which there was no application at the destination port [udpNoPorts].

idgmer6/s

The number of received UDP datagrams per second that could not be delivered for reasons other than the lack of an application at the destination port [udpInErrors].

The ALL keyword is equivalent to specifying all the keywords above and therefore all the network activities are reported.

-o [ filename ]

Save the readings in the file in binary form. Each reading is in a separate record. The default value of the filenameparameter is the current daily data file, the /var/log/sa/sadd file. The -o option is exclusive of the -f option. All the data available from the kernel are saved in the file (in fact, sar calls its data collector sadc with the option “-S ALL”. Seesadc(8) manual page).

-P { cpu [,…] | ALL }

Report per-processor statistics for the specified processor or processors. Specifying the ALL keyword reports statistics for each individual processor, and globally for all processors. Note that processor 0 is the first processor.

-p

Pretty-print device names. Use this option in conjunction with option -d. By default names are printed as dev m-nwhere m and n are the major and minor numbers for the device. Use of this option displays the names of the devices as they (should) appear in /dev. Name mappings are controlled by /etc/sysconfig/sysstat.ioconf.

-q

Report queue length and load averages. The following values are displayed:

负载情况。

runq-sz

Run queue length (number of tasks waiting for run time).

等待执行的队列长度。

plist-sz

Number of tasks in the task list.

总任务数目。

ldavg-1

System load average for the last minute. The load average is calculated as the average number of runnable or running tasks (R state), and the number of tasks in uninterruptible sleep (D state) over the specified interval.

1分钟负载情况,包括了 可执行、正在执行,以及不可中断的休眠状态的进程。

ldavg-5

System load average for the past 5 minutes.

ldavg-15

System load average for the past 15 minutes.

-r

Report memory utilization statistics. The following values are displayed:

内存使用情况监控。

kbmemfree

Amount of free memory available in kilobytes.

kbmemused

Amount of used memory in kilobytes. This does not take into account memory used by the kernel itself.

内存用量,未考虑内核使用的内存。

%memused

Percentage of used memory.

kbbuffers

Amount of memory used as buffers by the kernel in kilobytes.

内核使用的buffer大小,单位是KB。关于buffer和cache的区别,请参考Understanding free command in Linux/UnixOverview of memory management。But in future if any application want to use these buffers/cache, Linux will free it for you。

kbcached

Amount of memory used to cache data by the kernel in kilobytes.

内核使用的cache大小,单位是KB。But in future if any application want to use these buffers/cache, Linux will free it for you。

kbcommit

Amount of memory in kilobytes needed for current workload. This is an estimate of how much RAM/swap is needed to guarantee that there never is out of memory.

保证系统正常运行所需的内存,这是一个预测值,单位是KB。

%commit

Percentage of memory needed for current workload in relation to the total amount of memory (RAM+swap). This number may be greater than 100% because the kernel usually overcommits memory.

kbactive

Amount of active memory in kilobytes (memory that has been used more recently and usually not reclaimed unless absolutely necessary).

活跃内存用量,如果非必须,不会被reclaimed。

kbinact

Amount of inactive memory in kilobytes (memory which has been less recently used. It is more eligible to be reclaimed for other  purposes).

非活跃内存用量。

-R

Report memory statistics. The following values are displayed:

也是内存相关监控。

frmpg/s

Number of memory pages freed by the system per second. A negative value represents a number of pages allocated by the system. Note that a page has a size of 4 kB or 8 kB according to the machine architecture.

系统每秒释放的内存页数目。负值代表系统 申请的内存页 > 释放的内存页。内存页的大小可能是4kB或8KB。

bufpg/s

Number of additional memory pages used as buffers by the system per second. A negative value means fewer pages used as buffers by the system.

每秒用于buffer用途的内存页。

campg/s

Number of additional memory pages cached by the system per second. A negative value means fewer pages in the cache.

每秒用于cache用途的内存页。

-s [ hh:mm:ss ]

Set the starting time of the data, causing the sar command to extract records time-tagged at, or following, the time specified. The default starting time is 08:00:00. Hours must be given in 24-hour format. This option can be used only when data are read from a file (option -f ).

-S

Report swap space utilization statistics. The following values are displayed:

kbswpfree

Amount of free swap space in kilobytes.

kbswpused

Amount of used swap space in kilobytes.

%swpused

Percentage of used swap space.

kbswpcad

Amount of cached swap memory in kilobytes. This is memory that once was swapped out, is swapped back in but still also is in the swap area (if memory is needed it doesn’t need to be swapped out again because it is already in the swap area. This saves I/O).

%swpcad

Percentage of cached swap memory in relation to the amount of used swap space.

-t

When reading data from a daily data file, indicate that sar should display the timestamps in the original locale time of the data file creator. Without this option, the sar command displays the timestamps in the user’s locale time.

-u [ ALL ]

Report CPU utilization. The ALL keyword indicates that all the CPU fields should be displayed. The report may show the following fields:

CPU使用监控。也是默认监控,即sar -u  == sar。

%user

Percentage of CPU utilization that occurred while executing at the user level (application). Note that this field includes time spent running virtual processors.

%usr

Percentage of CPU utilization that occurred while executing at the user level (application). Note that this field does NOT include time spent running virtual processors.

%nice

Percentage of CPU utilization that occurred while executing at the user level with nice priority.

%system

Percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this field includes time spent servicing hardware and software interrupts.

%sys

Percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this field does NOT include time spent servicing hardware or software interrupts.

%iowait

Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

虽然CPU是空闲的,但还有未完成的IO请求的占比。

%steal

Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.

%irq

Percentage of time spent by the CPU or CPUs to service hardware interrupts.

硬中断的处理时间占比。

%soft

Percentage of time spent by the CPU or CPUs to service software interrupts.

软中断的处理时间占比。

%guest

Percentage of time spent by the CPU or CPUs to run a virtual processor.

%idle

Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.

CPU是空闲的,且没有未完成的IO请求。与%iowait的区别请注意。

Note: On SMP machines a processor that does not have any activity at all (0.00 for every field) is a disabled (offline) processor.

-v

Report status of inode, file and other kernel tables. The following values are displayed:

文件系统相关数据。

dentunusd

Number of unused cache entries in the directory cache.

file-nr

Number of file handles used by the system.

inode-nr

Number of inode handlers used by the system.

pty-nr

Number of pseudo-terminals used by the system.

-V

Print version number then exit.

-w

Report task creation and system switching activity.

任务调度数据。

proc/s

Total number of tasks created per second.

cswch/s

Total number of context switches per second.

-W

Report swapping statistics. The following values are displayed:

pswpin/s

Total number of swap pages the system brought in per second.

pswpout/s

Total number of swap pages the system brought out per second.

-y

Report TTY device activity. The following values are displayed:

rcvin/s

Number of receive interrupts per second for current serial line. Serial line number is given in the TTY column.

xmtin/s

Number of transmit interrupts per second for current serial line.

framerr/s

Number of frame errors per second for current serial line.

prtyerr/s

Number of parity errors per second for current serial line.

brk/s

Number of breaks per second for current serial line.

ovrun/s

Number of overrun errors per second for current serial line.

Note that with recent 2.6 kernels, these statistics can be retrieved only by root.

Environment

The sar command takes into account the following environment variables:

S_TIME_FORMAT

If this variable exists and its value is ISO then the current locale will be ignored when printing the date in the report header. The sar command will use the ISO 8601 format (YYYY-MM-DD) instead.

S_TIME_DEF_TIME

If this variable exists and its value is UTC then sar will save its data in UTC time (data will still be displayed in local time). sar will also use UTC time instead of local time to determine the current daily data file located in the /var/log/sadirectory. This variable may be useful for servers with users located across several timezones.

Examples

sar -u 2 5

Report CPU utilization for each 2 seconds. 5 lines are displayed.

sar -I 14 -o int14.file 2 10

Report statistics on IRQ 14 for each 2 seconds. 10 lines are displayed. Data are stored in a file called int14.file.

sar -r -n DEV -f /var/log/sa/sa16

Display memory and network statistics saved in daily data file ‘sa16′.

sar -A

Display all the statistics saved in current daily data file.

Bugs

/proc filesystem must be mounted for the sar command to work.

All the statistics are not necessarily available, depending on the kernel version used.

Files

/var/log/sa/sadd

Indicate the daily data file, where the dd parameter is a number representing the day of the month.

/proc contains various files with system statistics.

Author

Sebastien Godard (sysstat <at> orange.fr)

See Also

sadc(8)sa1(8)sa2(8)sadf(1)isag(1)pidstat(1)mpstat(1)iostat(1)vmstat(8)

 

存储型服务器

首先以一台存储服务器为例,进行内存相关分析。

free命令输出如下,首先需关注的是65953012B=62GB,但该系统物理内存是64GB,这是因为total不包含内核使用的内存。如果有硬件保留了部分内存,那也未计算在内。

从第二行可用看到,若视buf和cached都为可复用,则总可用内存还有54G,比较充足。但如果不包含buf和cached,完全free的内存只有268M,是非常小的。而尤其是cached是将常用磁盘数据缓存,如果频繁flush掉这块内存,对性能是有影响的,即本来读内存的操作变成了读磁盘的操作。那如何确定内存真的够使呢?往下看。

sar -r 1 5 查看内存整体使用情况如下,前面几列数据与free一致,关注kbactive和kbinact 字段,也表明如果发生memory reclaim,还有30G+ inact不经常使用的内存可用,一般不会触及经常active的内存。

sar -B 1 查看paging信息,下图是从中截取的一段。可用看到pgpgin/s在一段时间内持续为1024,而pgpgout/s是间歇性的突增。fault/s虽然较大,majflt/s为0,代表虽然发生了页中断,但主要是在虚拟内存到物理内存的映射上,基本没有从disk到内存的中断,所以性能可以接受。这也是跟该服务器承载业务相关,其上运行hbase, hdfs, zookeeper, thrift以及ETL C++进程,写多读很少,所以在没有发生split、merge的时候,基本没有从disk到磁盘的数据读入。

sar -R 1 10,这段时间新申请的内存页大于释放的内存页,且新增了少量用于buffer的内存页,新增了较多用于cached的内存页。

统计所有进程的vsz和rss用量,vsz达71G,rss仅9G。可以看到vsz虚拟内存已经超过了62G系统总内存,归功于虚拟内存系统,才可以使总内存看起来增加了。而rss内存小于free命令输出第二行的 -buffers/cache的11G,显然也不会包含buffer和cached内存。

$ ps axu | perl -ne ‘BEGIN{$vsz=0; $rss=0;} split; $vsz+=$_[4]; $rss+=$_[5];END{print $vsz .”\t”. $rss .”\n”;}’

71522748 9625152

其中vsz用量top10的进程如下,可以看到大部分进程的rss远远小于vsz,因为Linux针对用户进程的malloc内存会尽量延迟分配,而代码段、数据段等也只会使用到时,才依赖缺页中断从磁盘加载到内存中,否则只是虚拟内存的逻辑空间地址而已。

  1. thrift 4,901,424 486,352
  2. hbase region server 4,789,688 4,485,900
  3. hbase master 4,672,580 232,412
  4. zookeeper 4,588,768 85,508
  5. ETL C++ 1,885,824 787,656
  6. ETL C++ 1,862,092 774,448
  7. ETL C++ 1,682,696 414,276
  8. hadoop datanode 1,416,264 176,760
  9. hadoop datanode 1,376,536 177,436
  10. 监控平台进程 996,048 14,372

计算型Nginx服务器

再以一台Nginx服务器为例,分析内存。其buffer用量较少,仅290M,buffer主要用于块设备的读写缓存,应该也是由于nginx服务面向http请求,其数据吞吐与HBase相比较小。

sar -r 1 5:

sar -B 1:

sar -R 1 10,平均而言这段时间申请的内存页要多于释放的内存页,没有新增用于buffer的内存页,新增少量用于cached的内存页。

统计所有进程的vsz和rss用量,vsz仅18G,rss仅3G,可以看到这台服务器内存非常充沛。

$ ps axu | perl -ne ‘BEGIN{$vsz=0; $rss=0;} split; $vsz+=$_[4]; $rss+=$_[5];END{print $vsz .”\t”. $rss .”\n”;}’
18630732 3119256

 

 

参考资料

http://www.linuxatemyram.com/play.html

http://www.linuxhowtos.org/System/Linux%20Memory%20Management.htm

http://www.linuxnix.com/2013/05/find-ram-size-in-linuxunix.html

http://en.wikipedia.org/wiki/Paging

http://www.win.tue.nl/~aeb/linux/lk/lk-9.html

http://blog.csdn.net/dlutbrucezhang/article/details/9058583

http://oss.org.cn/kernel-book/

ZEND_MM_SMALL_FREE_BUCKET宏,虽然鸟哥等人也有描述,可能是我愚笨,确实没有看懂,又自行一点点地扣了一下,记录如下:

struct _zend_mm_heap {

zend_mm_free_block *free_buckets[ZEND_MM_NUM_BUCKETS*2]; // 指针的数组

};

#define ZEND_MM_SMALL_FREE_BUCKET(heap, index) \

(zend_mm_free_block*) ((char*)&heap->free_buckets[index * 2] + \

sizeof(zend_mm_free_block*) * 2 – \

sizeof(zend_mm_small_free_block))

如上图所示,首先,在为struct _zend_mm_heap * heap分配完内存之后,heap->free_buckets指向一块包含ZEND_MM_NUM_BUCKETS * 2个指针元素的数组。

假设使用index=0调用该宏,那么执行完(char*)&heap->free_buckets[index * 2]之后,指针指向free_bucket的第一个字节,且注意,已经被转换为(char*)了!

随后,当+sizeof(zend_mm_free_block*) * 2后,即向后偏移了两个pointer,在64位机上就是8*2=16个字节。这时,在我们的脑海里虚幻出来一个假的zend_mm_small_free_block,它以当前指针为结尾。并且,请注意zend_mm_small_free_block与zend_mm_free_block的区别如下图,不论debug宏等如何设置,它们前半段都是一样的

随后,再-sizeof(zend_mm_small_free_block)并显示转化为(zend_mm_free_block*)指针之后,返回值其实是指向一块偏移为负值的非法内存的。但那又怎样呢?只要我们不去访问非法区域就ok了啊!返回值仅会进行的操作就是取p->prev_free_block和next_free_block,而这两块确确实实是正确映射的!

引言

现在的编程语言,大多数有自己的内存管理机制,而不像C、C++需要程序员自己malloc、free内存。所谓内存管理,我的理解是,由虚拟机、编译器(编译原理有点忘记了,不太记得应该是属于哪一步了)等自动判断何时需要分配内存、怎样分配、何时需要释放内存。在PHP,这个工作是由Zend引擎来做的。

这样做的原因,一方面是简化了程序员的工作,另一方面,像PHP这种需要长时间运行的进程,一旦有内存泄漏,后果不堪设想。并且,malloc这种函数,是系统函数,会导致用户态与内核态的切换,通过预分配内存的方式,减少切换,也可以提高性能。

在写PHP扩展的时候,推荐使用的是pemalloc函数(其实是个宏),该函数可以申请persistent的内存,或者per-request的内存。前者可以在多个请求间共享,后者是每次请求结束就释放的。一般应用层面的PHP扩展,常用的都是per-request内存。

pmalloc的per-request方式,实际调用的是_emalloc函数。这条线,就涉及到PHP的内存管理了。

在pmalloc这些接口函数的下面,有heap和存储层。heap可以看作是PHP维护的一些分配好的内存,构造为堆、bucket、树等数据结构,等待接口层使用。而存储层,会真正的向操作系统请求和释放内存。

具体关注下heap层,分为小内存指针数组,和大内存指针数组,分别存储着多个不同尺寸的bucket列表(想像成二维的吧)。当需要内存时,首先根据真实size,计算一个合适的bucket大小(可能略大于实际需求),如果这个bucket列表里有可用内存,则直接使用;否则,向存储层请求一大块内存,初始化为buckets,取用其中一块。这里,怎样高效的定位到bucket列表,怎样快速找到可用bucket,freeed的bucket怎样回收,都是有趣的事情。结合到struct heap 中的bitmap字段,猜测应该使用到了位map。具体待看。

当然,heap不可能无限制的扩张,每个php进程都有memory_limit限制,如果使用的内存超过这个限制,php进程就会崩溃。

问题

以上的引言只是泛泛而谈,可能每个大型软件都会考虑这些。具体到PHP,又是怎样做的呢?

  • zend memory manager是如何管理内存的呢?具体而言,怎样分配内存、回收、碎片整理的呢?unset是否立刻触发内存释放?
  • 对PHP的内存使用,如何监控?包括VM、RSS,甚至heap、stack等?
  • 在实现层面,heap、storage层的数据结构如何?底层的malloc、mmap_xxx方法都如何使用?

基础知识与监控方法

一些内存的基础知识,强烈建议先阅读:http://www.slideshare.net/jpauli/understanding-php-memory

注意,使用cli模式启动php时,grep有两个进程:

其中第一个是bash的进程,第二个才是php进程,以下分析时,都是针对php进程而言的。

$ cat /proc/55082/status

Name: php

State: S (sleeping)

Tgid: 55082 // thread id, 必定属于某个进程

Pid: 55082  // 当前进程id

PPid: 40321  // 父进程id

TracerPid: 0 // 跟踪当前进程的进程ID,如果是0,表示没有跟踪。例如可以用strace -p跟踪进程

Uid: 11797 11797 11797 11797

Gid: 11797 11797 11797 11797

FDSize: 256 // 当前分配的文件描述符,按32、64对齐(由于fd号不能复用,所以关闭fd不会使该值降低)

Groups: 11797

VmPeak:   52836 kB // VM峰值

VmSize:   52836 kB // size of VM map

VmLck:       0 kB // 代表进程已经锁住的物理内存的大小.锁住的物理内存不能交换到硬盘

VmHWM:     1032 kB  // 程序分配到物理内存的峰值

VmRSS:     1032 kB // resident set size, 实际物理内存使用

VmAnon:     168 kB

VmFile:     864 kB

VmData:     188 kB // 数据存储空间,进程独占。注意,包含了进程内存段里的data segment和heap,即malloc动态分配的内存也在这儿

VmStk:       88 kB // size of stack segment in VM,进程独占。对应进程内存段的stack segment

VmExe:     688 kB // size of text segment in Vm,机器中运行同一程序的数个进程共享

VmLib:     1296 kB // 进程所使用lib的大小

VmPTE:       56 kB

VmSwap:       0 kB

Threads: 1

SigQ: 1/515319

SigPnd: 0000000000000000

从proc status文件能够看到内存消耗概况,例如总量、各segment用量。但到底有哪些东西在消耗内存呢?这时可以通过pmap -x <pid>命令,或者/proc/<pid>/maps ,  /proc/<pid>/smaps文件查看详情。

$man 5 proc // 可以查看proc下一些文件的说明

注意一下perms字段,其中可以显示出copy on write特性。

通过以上方法,能够看到PHP依赖的so包、PHP内核引擎和用户代码、动态数据、stack等的使用情况。但如果我们想具体看执行到某行代码时,所占用的内存大小,怎么办呢?

这时可以依赖PHP提供的memory_get_usage()方法。注意其唯一的参数,默认为false:

real_usage

Set this to TRUE to get the real size of memory allocated from system. If not set or FALSE only the memory used by emalloc() is reported.

如果设置为false,则仅显示通过emalloc分配的内存,而persistent的pemalloc分配的内存就没有包含在内了。

以一个最简单的php代码为例:

$arr = array();

for ($i = 0; $i < 100000; $i++) {

    $arr[] = str_pad($i, 'a', 10000);

}

var_dump(memory_get_usage(true));

var_dump(memory_get_usage(false));

echo "Sleep...\n";

sleep(1000);

  1. 其ps显示的vsz大小为109M,rss大小为23M,这里仅关注rss内存先,代表整个进程独立消耗的实际内存大小。
  2. cat /proc/<pid>/status,VmRss = 23236kB(与ps结论一致),VmData = 19836kB这里再关注VmData代表已初始化的全局变量、heap内存。
  3. memory_get_usage(true) = 17563648,real size of memory allocated from system
  4. memory_get_usage(false) = 17148992,only the memory used by emalloc()
但为什么memory_get_usage(true)返回的内存用量也小于VmData呢?首先,该方法度量的是通过PHP分配的内存,无论是调用emalloc族还是pemalloc族,虽然用户的php代码里所有的变量(包括“全局变量”),都包括在内。但PHP内核和Zend引擎必定有一些已初始化的全局变量,这些属于data segment,不会被memory_get_usage度量,却属于VmData。另外,PHP依赖的一些更底层的so包,如果需要内存,肯定是自行通过malloc获取的,虽然也属于heap,但也不会被memory_get_usage度量。还有不排除一些不标准的PHP扩展会绕过emalloc/pemalloc,如果自行通过malloc管理内存,那也不会被memory_get_usage度量了。所以,可以认为memory_get_usage统计的内存是VmData的一个子集。

内存管理

zval、refcount和is_ref相关基础知识,请参考,尤其是其中“变量赋值与引用计数”一段。

PHP的内存管理是由Zend Memory Manager实现的。为了整个掌管内存,Zend MM中有一个重要的zend_mm_heap结构体,其中包含了storage(存储层的各种函数指针)、可用内存buckets bitmap、可用内存buckets链表等。相关重要数据结构列举如下,其中注意zend_mm_heap的cache、free_buckets、large_free_buckets、rest_buckets都是指针的数组

heap初始化

heap一个重要属性是storage的实现,根据Zend/README.ZEND_MM说法:

The Zend MM can be tweaked using ZEND_MM_MEM_TYPE and ZEND_MM_SEG_SIZE environment

variables.  Default values are “malloc” and “256K”. Dependent on target system you

can also use “mmap_anon”, “mmap_zero” and “win32″ storage managers.

$ ZEND_MM_MEM_TYPE=mmap_anon ZEND_MM_SEG_SIZE=1M sapi/cli/php ..etc.

即Linux默认情况下使用”malloc”作为storage handler。注意,这里并不是指malloc(),而是zend_mm_mem_handlers.name字段,可选项有”win32″,”malloc”,”mmap_anon”,”mmap_zero”。

heap初始化过程如下流程图所示:

其中,zend_mm_init中对free_buckets的设置比较特殊,利用ZEND_MM_SMALL_FREE_BUCKETS宏操作内存偏移,可参考《PHP的ZEND_MM_SMALL_FREE_BUCKET》。zend_mm_init后,heap->free_buckets 如下图所示,即free_buckets里仅存储了prev和next指针,而非先存储zend_mm_free_block*指针,再开辟新内存,由它指向的区域保持prev和next以及其他信息。

heap->rest_bucket[0]和rest_bucket[1]这时都是指向与上面类似的dummy内存区域。

heap内存的使用

TODO: _zend_mm_alloc_int 

一些常用宏说明如下:

#define ZEND_MM_ALIGNED_HEADER_SIZE ZEND_MM_ALIGNED_SIZE(sizeof(zend_mm_block))

// 对齐后的 zend_mm_block 大小,包含了block最基本的信息:当前block的size,前一个block的size等

 

#define ZEND_MM_ALIGNED_FREE_HEADER_SIZE ZEND_MM_ALIGNED_SIZE(sizeof(zend_mm_small_free_block))

// 对齐后的 zend_mm_small_free_block 大小,即小块内存头信息的大小

 

#define ZEND_MM_MIN_ALLOC_BLOCK_SIZE ZEND_MM_ALIGNED_SIZE(ZEND_MM_ALIGNED_HEADER_SIZE + END_MAGIC_SIZE)

// 对齐后的 zend_mm_block 以及魔术码的大小,在第一个宏基础上添加了魔术码所需空间

 

#define ZEND_MM_ALIGNED_MIN_HEADER_SIZE (ZEND_MM_MIN_ALLOC_BLOCK_SIZE>ZEND_MM_ALIGNED_FREE_HEADER_SIZE?ZEND_MM_MIN_ALLOC_BLOCK_SIZE:ZEND_MM_ALIGNED_FREE_HEADER_SIZE)

// 对齐后的最小头信息,取第二个和第三个宏里大的那个值

 

#define ZEND_MM_ALIGNED_SEGMENT_SIZE ZEND_MM_ALIGNED_SIZE(sizeof(zend_mm_segment))

// 对齐后的zend_mm_segment大小

#define ZEND_MM_MIN_SIZE ((ZEND_MM_ALIGNED_MIN_HEADER_SIZE>(ZEND_MM_ALIGNED_HEADER_SIZE+END_MAGIC_SIZE))?(ZEND_MM_ALIGNE
D_MIN_HEADER_SIZE-(ZEND_MM_ALIGNED_HEADER_SIZE+END_MAGIC_SIZE)):0)

// 需分配的最小内存,即如果MIN_HEADER_SIZE取的是比MIN_ALLOC_BLOCK_SIZE大的值,那就是差值,否则为0

 

#define ZEND_MM_MAX_SMALL_SIZE ((ZEND_MM_NUM_BUCKETS<<ZEND_MM_ALIGNMENT_LOG2)+ZEND_MM_ALIGNED_MIN_HEADER_SIZE)

// 小块内存的上限,如果超过这个值,就得使用large_free_buckets了

#define ZEND_MM_TRUE_SIZE(size) ((size<ZEND_MM_MIN_SIZE)?(ZEND_MM_ALIGNED_MIN_HEADER_SIZE):(ZEND_MM_ALIGNED_SIZE(size+ZEND_MM_A
LIGNED_HEADER_SIZE+END_MAGIC_SIZE)))

// 如果申请内存比MIN_SIZE还小,那就直接申请MIN_HEADER_SIZE即可;否则除了size之外,还需申请头信息和魔术码所需内存

#define ZEND_MM_BUCKET_INDEX(true_size) ((true_size>>ZEND_MM_ALIGNMENT_LOG2)-(ZEND_MM_ALIGNED_MIN_HEADER_SIZE>>ZEND_MM_ALIGNMENT_LOG2))

// 计算真实数据的字节偏移量

#define ZEND_MM_SMALL_SIZE(true_size) (true_size < ZEND_MM_MAX_SMALL_SIZE)

// 是否应该使用小块内存

 

 

需注意的一个细节是heap内存的分配都是8位对齐的,通过ZEND_MM_ALIGNED_SIZE宏实现。

heap内存的回收

 

unset与内存

在启用USE_ZEND_ALLOC(默认启用)时,unset后可以看到RSS内存的明显下降,但仍大于分配这块内存前;而若关闭USE_ZEND_ALLOC,则unset后RSS内存仅略微下降。

TODO

参考资料

http://www.slideshare.net/jpauli/understanding-php-memory

http://www.laruence.com/2011/11/09/2277.html

https://wiki.php.net/internals/zend_mm

http://www.webreference.com/programming/php_mem/

http://www.ibm.com/developerworks/cn/opensource/os-php-v521/

http://blog.csdn.net/zjl410091917/article/details/8075691  proc status 字段含义介绍

http://www.cnblogs.com/jiayy/p/3458076.html proc下文件含义介绍

http://c.biancheng.net/cpp/html/476.html 指针数组和数组指针

 

类似PHP、Nginx之类,需要频繁申请与释放内存的软件,一般都提供了用户态的内存管理。原因是,直接调用malloc/free会导致用户态到内核态的切换,较为耗时。那么,两者的区别到底是什么呢?

内核态与用户态的定义

CPU只会运行在以下两种状态:

  1. Kernel ModeIn Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC.
  2. User ModeIn User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode.
不同硬件的实现方式可能也不同,x86是通过0-3的4层protection rings硬件来划分的。据称Linux仅使用ring 0作为内核态,ring 3作为用户态,未涉及ring 1-2;而windows中部分drivers会使用ring 1-2。

何时会发生切换

Typically, there are 2 points of switching:

  1. When calling a System Call: after calling a System Call, the task voluntary calls pieces of code living in Kernel Mode
  2. When an IRQ (or exception) comes: after the IRQ an IRQ handler (or exception handler) is called, then control returns back to the task that was interrupted like nothing was happened.

IRQ全称为Interrupt Request,即是“中断请求”的意思,IRQ的作用就是在我们所用的电脑中,执行硬件中断请求的动作。

一般而言,系统调用是用户主动发起的,例如调用fork函数,会间接调用系统函数sys_fork,从而陷入内核态。而IRQ的发生也有用户“主动”和“被动”两种形式:例如用户调用malloc申请内存,可能会导致缺页异常,引发IRQ陷入内核态;或者我们需要读取硬盘中的一段数据时,当数据读取完毕,硬盘就通过IRQ来通知系统,相应的数据已经写到指定的内存中了。

切换成本

从用户态到内核态,本质上都是响应中断。因为系统调用实际上最终是中断机制实现的,而异常和中断的处理机制基本上也是一致的。其切换过程如下:

  1. 从当前进程的描述符中提取其内核栈的ss0及esp0信息。
  2. 使用ss0和esp0指向的内核栈将当前进程的cs,eip,eflags,ss,esp信息保存起来,这个过程也完成了由用户栈到内核栈的切换过程,同时保存了被暂停执行的程序的下一条指令。
  3. 将先前由中断向量检索得到的中断处理程序的cs,eip信息装入相应的寄存器,开始执行中断处理程序,这时就转到了内核态的程序执行了。

内核态到用户态需要将保存的进程信息予以恢复。

从上面的步骤可以看到,mode switch涉及大量数据的复制,还需要硬件配合,故耗时较大。而在不发生mode switch时,cpu只需顺序执行指令即可。所以应该尽量减少mode switch!

参考资料

http://blog.codinghorror.com/understanding-user-and-kernel-mode/

http://www.linfo.org/kernel_mode.html

http://www.tldp.org/HOWTO/KernelAnalysis-HOWTO-3.html

http://jakielong.iteye.com/blog/771663

http://os.ibds.kit.edu/downloads/publ_1995_liedtke_ukernel-construction.pdf

http://os.inf.tu-dresden.de/pubs/sosp97/

mac 10.10,打印东西的时候总是报:保持以备鉴定  的错误。

在确认驱动正确、用户名密码正确的前提下,发现一种诡异的解决方案:

1. 在finder界面的 “前往” =》 输入 smb://172.22.2.xx  (即打印机的服务器地址),挂载一个盘

2. 正常配置打印机

3. 打印,并输入用户名和密码,成功