Archive for 三月, 2010

参考:http://dev.mysql.com/doc/refman/5.0/en/kill.html

当发现mysqld占用cpu等资源过多的时候,可以查看当前哪些sql语句正在执行:

mysql>show processlist;

如果发现不希望运行的语句,可以停止它:

mysql>kill thread_id;

今天发现一个挺奇怪的现象,ms用户的home下有.bash_profile,其中我定义了一些path。但是当我从flykobe用户 sudo su ms之后,echo $PATH发现.bash_profile没有被加载。

flykobe@138 v1 $ echo $PATH
/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/x86_64-pc-linux-gnu/gcc-bin/4.1.2:/usr/local/apache/bin:/usr/local/mysql/bin:/home/yicheng/tools/:/usr/local/mysql/bin/:/home/yicheng/music_seo/script/:/home/yicheng/music_seo/script/smalltools/
ms@tj1cantispam002 /home/yicheng $ echo $PATH
/bin:/usr/bin
而且这个path明显也不是从flykobe用户带来的。
经过查找,发现sudo、su之间的微小差别:

su = switch user
su – username = switch user to that username and transfer current environment variables
sudo <command> = execute this command as root (requires you to be set as an admin in OS X)
sudo su = execute the switch user command as root, this is a little weird – you don’t need to be root to switch users.

也就是说,当我执行 sudo su ms的时候,仅仅是切换了用户,但是没有重新登录,所以也没有加载ms用户对应的一些配置文件。

而执行 sudo su – ms的时候,相当于真正的重新登录了!

最近有个用户推荐算法的项目,需要分析大量的log日志,并进行计算。由于时间较紧,来不及去研究mapreduce的成熟项目,就自己用php、bash等写了一些最简单的代码。

但是,这样的弊端是,每天都要跑较长时间,所以想办法把其中的calculate的部分抽取出来,做了简朴的分布式计算。

######## 在每台独立计算服务器上,进行计算 #####################
{scp $ANALYDATAPATH/idf.2 ms@10.60.0.105:$ANALYDATAPATH; ssh ms@10.60.0.105 "cd $ANALYPATH; ./cal.php" ;scp $ANALYDATAPATH/load.2   ms@10.60.1.138:$ANALYDATAPATH}&
{cd $ANALYPATH; ./cal.php; cd -}&
wait

这里主要参考了:http://www.cnitblog.com/sysop/archive/2008/11/03/50974.aspx

其中可以控制线程数目的代码比较有意思,摘录如下:

#!/bin/bash
# 2006-7-12, by wwy
#———————————————————————————–
# 此例子说明了一种用wait、read命令模拟多线程的一种技巧
# 此技巧往往用于多主机检查,比如ssh登录、ping等等这种单进程比较慢而不耗费cpu的情况
# 还说明了多线程的控制
#———————————————————————————–



function a_sub {
# 此处定义一个函数,作为一个线程(子进程)
sleep 3 # 线程的作用是sleep 3s
}


tmp_fifofile
=/tmp/$$.fifo
mkfifo
$tmp_fifofile # 新建一个fifo类型的文件
exec 6<>$tmp_fifofile # 将fd6指向fifo类型
rm $tmp_fifofile


thread
=15 # 此处定义线程数
for ((i=0;i<$thread;i++));do
echo
done
>&6 # 事实上就是在fd6中放置了$thread个回车符


for ((i=0;i<50;i++));do # 50次循环,可以理解为50个主机,或其他

read u6
# 一个read -u6命令执行一次,就从fd6中减去一个回车符,然后向下执行,
# fd6中没有回车符的时候,就停在这了,从而实现了线程数量控制


{
# 此处子进程开始执行,被放到后台
a_sub && { # 此处可以用来判断子进程的逻辑
echo a_sub is finished
}
|| {
echo
sub error
}
echo
>&6 # 当进程结束以后,再向fd6中加上一个回车符,即补上了read -u6减去的那个
} &

done

wait # 等待所有的后台子进程结束
exec 6>&- # 关闭df6


exit 0

sleep 3s,线程数为15,一共循环50次,所以,此脚本一共的执行时间大约为12秒

即:
15×3=45, 所以 3 x 3s = 9s
(50-45=5)<15, 所以 1 x 3s = 3s
所以 9s + 3s = 12s

$ time ./multithread.sh >/dev/null

real        0m12.025s
user        0m0.020s
sys         0m0.064s

而当不使用多线程技巧的时候,执行时间为:50 x 3s = 150s。

此程序中的命令

mkfifo tmpfile

和linux中的命令

mknod tmpfile p

效果相同。区别是mkfifo为POSIX标准,因此推荐使用它。该命令创建了一个先入先出的管道文件,并为其分配文件标志符6。管道文件是进程之间通信的一种方式,注意这一句很重要

exec 6<>$tmp_fifofile # 将fd6指向fifo类型

如果没有这句,在向文件$tmp_fifofile或者&6写入数据时,程序会被阻塞,直到有read读出了管道文件中的数据为止。而执行了上面这一句后就可以在程序运行期间不断向fifo类型的文件写入数据而不会阻塞,并且数据会被保存下来以供read程序读出。

当应用于shell脚本中时,很经常需要使用公钥进行ssh或者scp之类的操作。这样才能把一些shell脚本完全脱离人工干涉。

ssh-keygen命令提供了这样的功能:

0、应用场景:需要从服务器138,公钥控制服务器135和105.

1、首先在138上,生成密钥和公钥:

ms@tj1cantispam002 ~ $ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ms/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/ms/.ssh/id_rsa.
Your public key has been saved in /home/ms/.ssh/id_rsa.pub.
The key fingerprint is:
b6:29:ea:5b:1a:72:97:4a:71:d7:6a:f8:a3:88:15:03 ms@tj1cantispam002

可以看到,新生成的文件包括:id_rsa和id_rsa.pub。前者是放置在138上的密钥,后者的内容可以追加到105和135的authorized_keys文件中。

2、放置公钥

将id_rsa.pub scp到105和135上,追加到对应用户(这里是ms)home目录下.ssh/authorized_keys文件中:

cat /path/to/138/id_rsa.pub >> ~/.ssh/authorized_keys

3、免密码,使用公钥登陆

在138上,这时可以直接登陆105和135了:

ssh ms@10.60.0.105

ssh ms@10.60.1.135

需要注意的是,不同机器间,需要使用同样的用户账号名称。比如,这里我使用的都是“ms”账户。

另外,在该账户的.ssh目录下,会自动生成的known_hosts文件,保存了ssh、scp登陆过的主机信息。

————————————–

如果按照以上步骤配完,仍然无法免密码登陆的话,考虑以下解决方案:


1、如果除了你之外别人对“authorized_keys”文件也有写的权限,SSH就不会工作。 
2、对ssh使用 -v 参数debugv发现已经公钥授权已经通过,但接下来还是采用密码认证。百思不得其解。于是在网上查,好多人都遇到了这个问题,也有很多的解决方法,但每一个都试过了就是不行。
  • 方法一:改.ssh目录的权限为700      无效
  • 方法二:重新生成密钥               无效
  • 方法三:修改sshd_config里的选项    无效
  • 方法四:把可以登录的机器上的和ssh
     相关文件按原权限拷贝到本地         无效
  • 最后想到可能中途拷错,对所有的相关文件md5发现是一样的,很无语.......
    无意中发现这台机器配了samba并把root目录做了samba映射,可能是为了方便把root目录的权限改成了777。我当时用samba也很容易出现权限问题,当时也是用这种很“暴力”的方法解决的,但我不是对/root目录是对其下的一个小目录。想到/root目录在Linux上是有较强的安全管理规则的。可能是这个出了问题。对root改回650,再次测试成功通过。心喜!

zz from : http://arstechnica.com/science/news/2010/02/recommendation-algorithm-wants-to-show-you-something-new.ars

When it comes to recommendation systems, everybody’s looking to increase accuracy: the Netflix Prize was awarded last July for an algorithm that improved the accuracy of the service’s recommendation algorithm by 10 percent. However, computer scientists are finding a new metric to improve upon: recommendation diversity. In a paper that will be released by PNAS, a group of scientists are pushing the limits of recommendation systems, creating new algorithms that will make more tangential recommendations to users, which can help expand their interests, which will increase the longevity and utility of the recommendation system itself.

Accuracy has long been the most prized measurement in recommending content, like movies, links, or music. However, computer scientists note that this type of system can narrow the field of interest for each user the more it is used. Improved accuracy can result in a strong filtering based on a user’s interests, until the system can only recommend a small subset of all the content it has to offer.

The authors of the paper also note that accurate recommendations are not always useful. For example, suggesting one generic romantic comedy after another (say What Happens in Vegas and Just Married) just because a user rated When Harry Met Sally five stars is not helpful. Systems that base recommendations on correlations between users can miss niche items that a user might like, but would never find on his own. Research indicates that the most interesting recommendations and information originate from “weak ties” in a system, between users that are somewhat similar but disparate enough that they can introduce novelty to each other.

To widen the potential field of user interest, the authors developed a hybrid of two algorithms. One combined an algorithm that based its recommendations on random walks between highly connected users and material; the other mirrored the process of heat diffusion, spreading ratings at a decreasing level of potency as the recommendation had to travel further. The heat diffusion algorithm can be thought of as a system that has users connected in a network with the objects they have interacted with and evaluated, and values are passed among the items in this network to develop ratings.

The head diffusion model uses values of 1 or 0 for the material to be recommended—either a user liked something or he didn’t—and takes an average of the total resources a user had assigned to an object to give the user a value. For example, if a user liked two things and disliked two others, the value assigned to the user would be one-half.

The algorithm then averaged these values for any users connected to an object, and this became the object’s value in the system (for example, if two users were attached to an object and one had a value of one-half and the other had zero, the new value assigned to the object would be one quarter). All of this can be done using a small set of data, meaning the heat diffusion algorithm can make diverse yet relevant recommendations based onsparse data in one pass.

To test the algorithms individually and in hybrid form, scientists used data sets from Netflix, Rate Your Music and del.ici.ous, reducing ratings of various numbers of stars to likes or dislikes (three stars out of five and six out of ten qualified as a “like” in Netflix and RYM, respectively). They removed 10 percent of the selections from the data sets, and then applied the algorithms to test how much of the deleted data they could recover, as well as how many new and relevant selections the algorithms could make.

Combining the heat diffusion approach with the safer and more accurate random walk, the researchers found that they could create a body of recommendations that combined novelty items and safer, more accurate pieces. More importantly, using both allowed for more accurate recommendations than using either alone.

The hybrid took the form of a linear combination of the random walk and the heat diffusion algorithms, and the influence of each could be tuned by adjusting their coefficients to create more novelty or more accuracy as needed. This might allow for a system where a user could adjust the recommendations according to how interested they are in seeing something that may be outside of their normal content. The authors also noted that adding a global ranking algorithm that recommends items based on overall popularity could improve accuracy when little is known about the user.

While the accuracy of recommendations has been the prized focus (literally) in these systems, diversity and novelty are prized measures too (think of all those friends who boast about liking bands or movies before they were popular). The algorithms are still largely experimental, and the authors note that there is a significantly higher computational cost associated with using a hybrid algorithm. Nonetheless, diversity of suggestions seems to be the next horizon in refining recommendation systems.

PNAS, 2010. DOI: 10.1073/pnas.1000488107