completed jobs 日志

公司集群允许查看已完成job的列表,但默认配置下,点击进去,显示:

No event logs were found for this application! To enable event logging, set spark.eventLog.enabled to true and spark.eventLog.dir to the directory to which your event logs are written.

当设置spark UI配置的spark.eventLog.enabled为True,并且设置spark.eventLog.dir为本地文件路径后,点击进去,变成了:

No event logs found for application chengyi02_test_load_1GB in file:/home/users/chengyi02/spark-resource/test_perf/log//chengyi02_test_load_1gb-1427419993180. Did you specify the correct logging directory?

原因是,我的本地路径当然是无法被history server读取的了。不过这时查看该路径下,已经有日志了,看起来像json格式。

再次修改,把spark.eventLog.dir改为hdfs路径,运行spark报错。经与管理员沟通,得知是由于该spark集群的history server没有开导致的

15/03/27 09:50:32 WARN ServletHandler: /stages/

java.lang.NullPointerException

        at org.apache.spark.SparkContext.getAllPools(SparkContext.scala:892)

        at org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:50)

        at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)

        at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)

        at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70)

        at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)

        at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)

        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)

        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)

        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)

        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)

        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)

        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

        ……

partitionBy导致的too many value to unpack

http://stackoverflow.com/questions/7053551/python-valueerror-too-many-values-to-unpack

HDFS文件读写性能调优

第一次测试:60GB的数据,1108个文件,大小从20m-100m+不等。 max cores我设置为160,启动了1108个Tasks,貌似只用了5台服务器。 读入的耗时1.3min,写入的耗时我执行了两次,一次1.75min,一次8min(被hang在一个task上了)

第二次测试:max cores提升为1000,并设置spark.speculation为True,以便re-lauch slow tasks。 读入耗时提升为29s,写入耗时提升为41s。效果比较明显!由于文件数量没有变化,故仍然是1108个Tasks,但用了32台服务器。

第三次测试:max cores提升为1000,并设置spark.speculation为True。把文件大小统一为165M左右,共计359个文件,60GB数据。读入耗时40s,写入耗时55s。由于tasks数降低了,所以整体耗时反而下降。

第四次测试:max cores恢复为160,也不设置spark.speculation。把文件大小统一为165M左右,共计359个文件,60GB数据。读入耗时51s,写入耗时442s(被2个tasks hang住了)。所以设置spark.speculation还是非常重要的!

第五次测试:3.TB的数据,15846个文件,大小从27M-260MB不等,max cores设置为1w。启动时使用了32台服务器,每个上面并发32个tasks,所以整体并发度是1664。后面增加到了50+台。然后就把spark跑到FULL GC了。

java.io.IOException: org.apache.hadoop.hdfs.FMSClient$RetryableIOException: Could not obtain block: blk_31525197851421116_494327281

附图如下:

第六次测试,3.TB的数据,15846个文件,大小从27M-260MB不等,max cores设置为1w,但spark版本改为1.2。仍然报错:

错误类型1:

15/03/27 14:56:56 ERROR DAGScheduler: Failed to update accumulators for ResultTask(0, 4815)

java.net.SocketException: Broken pipe

错误类型2:

java.io.IOException: org.apache.hadoop.hdfs.FMSClient$RetryableIOException: Could not obtain block: blk_31525197851421116_494327281

at org.apache.hadoop.hdfs.FMSClient$DFSInputStream.read(FMSClient.java:2563)

……

以上第五第六次测试报错的原因,可能都是HDFS所在服务器有坏道导致的。

第七次测试,spark 1.2。1.3TB数据,共计114个文件,大小从几百字节到57G不等,且大文件居多。max cores设置1w,spark.speculation设置为True。启动了5353个Tasks,读入耗时5.1min!

第八次测试,同第七次,但使用spark 1.1。读入耗时3.6 min

读取没有权限的HDFS路径

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

修改application所在服务器的spark/conf/hadoop-site.xml里的UGI配置为正确的用户名密码即可。

broadcast an RDD in PySpark

>>>data = sc.textFile(‘/app/ecom/aries/sf_public/chengyi02/uclogin/20141120/cq02-dr-uclogin206.cq02/*’)

>>>sc.broadcast(data)

报错:py4j.Py4JException: Method __getnewargs__([]) does not exist

如果改成sc.broadcast(data.collectAsMap())就不报错了

解释:broadcast的必须是一个kv对 这个地方的colletAsMap不是做collect操作 只是把那个textFile的RDD转成kv对发出去

读取hdfs sequenceFile

Job aborted due to stage failure: Task 8 in stage 0.0 failed 4 times, most recent failure: Lost task 8.3 in stage 0.0 (TID 47, nmg01-spark-a0033.nmg01.baidu.com): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 64
        com.esotericsoftware.kryo.io.Output.require(Output.java:138)
        com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420)
        com.esotericsoftware.kryo.io.Output.writeString(Output.java:326)
        com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274)
        com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262)
        com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:156)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
        java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        java.lang.Thread.run(Thread.java:662)
Driver stacktrace:

任务非正常关闭

SparkDeploySchedulerBackend: Asked to remove non-existent executor

删除 /tmp/spark-*   ,  /tmp/fetchFileTemp*

任务启动失败

使用spark-submit启动任务失败,最下面的报错如下:

363 15/03/30 09:51:17 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
364 15/03/30 09:51:17 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED

原因是该job的tasks数量过大,设置为1w,而集群整体的Cores: 2272 Total, 472 Used。所以如果max cores > 2272,就会被杀掉!如果大于unused cores数量,就会先抢占可用的,然后剩余的tasks再排队。

再往上有如下日志:

353 15/03/30 09:51:17 INFO AppClient$ClientActor: Executor updated: app-20150330095112-25033/62 is now EXITED (Command exited with code 1)

根据task号,在日志里找到其对应的服务器:

345 15/03/30 09:51:16 INFO AppClient$ClientActor: Executor added: app-20150330095112-25033/62 on worker-20141125192713-nmg01-spark-a0062.nmg01.xxx.com-11543 (nmg01-spark-a0062.nmg01.xxx.com:11543) with 32 cores

在该spark集群监控页面,找到对应的worker,在Finished Executors里发现有大量failed tasks, 找到我的task,查看其对应的stdout日志:

Error occurred during initialization of VM
Could not reserve enough space for object heap

所以,这些可能是部分节点OOM,但在大部分情况下spark可用通过重跑或重试,避免问题。

Out of Memory

在调用takeSample对一个大list做抽样,并collect回来时,报错如下:其中飘红的两个节点,分别是一个worker,以及application所在服务器。

15/03/30 17:35:15 WARN DefaultChannelPipeline: An exception was thrown by a user handler while handling an exception event ([id: 0x22dfe421, /10.75.65.12:12237 => /10.48.23.31:30001] EXCEPTION: java.lang.OutOfMemoryError: Java heap space)

java.lang.OutOfMemoryError: Java heap space

at java.lang.Object.clone(Native Method)

at akka.util.CompactByteString$.apply(ByteString.scala:410)

at akka.util.ByteString$.apply(ByteString.scala:22)

at akka.remote.transport.netty.TcpHandlers$class.onMessage(TcpSupport.scala:45)

at akka.remote.transport.netty.TcpServerHandler.onMessage(TcpSupport.scala:57)

 

broadcast大数据

文件大小是1GB左右,通过textFile().collect()后,占用xxxx内存。

配置有:

 

问题1:

spark-submit提交的任务退出,但没有错误提示,history UI 列表里看到任务是finish了,但点击进去发现还有active的job。

进入到executor所在UI列表页,找到该app对应的stderr,查看如下:

15/04/07 13:07:47 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@nmg01-taihang-d10538.nmg01.baidu.com:54765] -> [akka.tcp://sparkDriver@cq01-rdqa-dev006.cq01.baidu.com:16313] disassociated! Shutting down.
15/04/07 13:07:47 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@cq01-rdqa-dev006.cq01.baidu.com:16313] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

根据http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-3106-fixed-td16264.html#a16316 ,修改cf.set(‘spark.akka.heartbeat.interval’, 10000)无效。

本机driver的问题:

$dmesg

lowest memcgroup 19

[chengyi02@cq01-rdqa-dev006.cq01.baidu.com test_perf]$ strace -p 19970
Process 19970 attached – interrupt to quit
futex(0x40e4e9f0, FUTEX_WAIT, 19974, NULL

跑python unittest

参考spark/python/run-tests可以看到:

SPARK_TESTING=1  /path/to/spark-*-client/bin/pyspark tests/schedule_frame_test.py

 

Leave a Reply