Archive for the ‘recommend system’ Category

在网上没有找到合适的开源collaborative filtering的C语言实现,所以决定自己动手写一个。

目前网络上的cf资源请参考前文:Collaborative Filtering Resources在动手之前,试用了SUGGEST库,这是一个库文件,没有开放源代码,但是性能和推荐结果都挺不错的,我没有妄图超越它。

计划写的内容(不一定按照计划来):

窥探SUGGEST库

实现user-based cf算法:使用三元组存储稀疏矩阵

实现user-based cf算法:计算用户间距离

实现user-based cf算法:根据相似用户进行推荐

实现item-based cf算法

根据item相关性,改进cf算法

由于我目前只初步实现了user-based cf算法,所以只能保证前4条。之后,我希望可以把源码开放出来,给同学们做一个参考。也希望更多更好的cf算法能够开源。

Collaborative Filtering Resources

maintained by Jun Wang

Generally, collaborative filtering (CF) is any algorithm that filters information for a user based on a collection of user profiles. Users having similar profiles may share similar interests. For a user, information can be filtered in/out regarding to the behaviors of his or her similar users.

Users profiles can be collected either explicitly or implicitly. One can explicitly ask users to rate what they have used/purchased. Such a profile is filled explicitly by the users ratings. An implicit profile is based on passive observation and contains users historic interaction data.

The most common usage of CF is to make recommendation. That’s why collaborative filtering is strongly correlated to recommender system in literature, although CF is only one of the methods for recommender system.

In this page, I collected some useful online materials for collaborative filtering research.


Content


Research Software


Data Sets

Explicit Rating Data Sets:

Implicit Rating Data Sets:

  • Audioscrobblers Music Play-list Data-sets.The Audioscrobbler dataset collects the play-lists of the users in a one-line community (http://www.audioscrobbler.com/) by using a plug-in in the users’ media players such as Winamp, iTunes, XMMS etc. The plug-ins send the title and artist of every song users play to the Audioscrobbler server, which updates the user’s musical profile with the new songs. In the database, the user’s profile is recorded as a form of co-occurrence pair like {userID,itemID} pair. The pair means a user {userID} has played a\ song {itemID}. The dataset can be obtained at http://www.audioscrobbler.com/data/
  • AOL Web search query: http://www.gregsadetsky.com/aol-data/

Collaborative Filtering Bibliography

1. Pure Collaborative Filtering

Memory-based

Relevance Models

Latent Class Models

Matrix Factorization

Clustering

Transitive Associations

Trust Inference

Perception-based

  • Online ranking/collaborative filtering using the perception algorithm (2003).

2. Combining Content-based and Collaborative Filtering

3. Distributed Collaborative Filtering

4. Other issues


Related Information Retrieval Papers

In general, collaborative filtering is formulated as a self-contained problem, apart from classic approaches for text retrieval, e.g. RSJ models and language models. However, the collaborative filtering problem can be treated as a prediction problem – a prediction of the relevance between user and item (see user-item relevance models). Under this veiw, the instant benefits are gained from the current advances in these text retrieval models. We found the following papers are pretty interesting and are related to the collaborative filtering problem.


Related Machine Learning Papers

zz from:http://ict.ewi.tudelft.nl/~jun/CollaborativeFiltering.html

SCWS是一款简单的开源中文分词系统,其网址为:http://www.ftphp.com/scws/

在代码中添加了一些输出,以便探究其分词的算法:

[yicheng@chengyi cli]$ ./scws -c utf-8 -d dict.utf8.xdb -r rules.utf8.ini -D -i ‘中国人民在中国人民大学上大学’ -t 10
No. WordString               Attr  Weight(times)
————————————————-
[0,0]Ox11(中)
[0,1]Ox83(中国)
[0,2]Ox81(中国人)
[1,1]Ox31(国)
[1,2]Ox81(国人)
[2,2]Ox31(人)
[2,3]Ox81(人民)
[3,3]Ox21(民)
[4,4]Ox1(在)
[5,5]Ox11(中)
[5,6]Ox83(中国)
[5,7]Ox83(中国人)
[5,10]Ox81(中国人民大学)
[6,6]Ox31(国)
[6,7]Ox81(国人)
[7,7]Ox31(人)
[7,8]Ox83(人民)
[7,10]Ox81(人民大学)
[8,8]Ox21(民)
[9,9]Ox31(大)
[9,10]Ox81(大学)
[10,10]Ox21(学)
[11,11]Ox1(上)
[12,12]Ox11(大)
[12,13]Ox81(大学)
[13,13]Ox21(学)
PATH by keyword = 中国人, (weight=8.9964):
中国人 民
PATH by keyword = 中国, (weight=219.7764):
中国 人民
PATH by keyword = 国人, (weight=0.0218):
中 国人 民
PATH by keyword = 人民, (weight=219.7764):
中国 人民
01. 中国人民大学       nt    10.64(1)
02. 中国                   ns    6.26(1)
03. 人民                   n     4.41(1)
04. 大学                   n     4.23(1)
+–[lt-scws(scws-cli/1.1.2)]———-+
| TextLen:   42                  |
| Prepare:   0.0074    (sec)     |
| Segment:   0.0010    (sec)     |
+——————————–+

以下简单记录今天看代码的收获。

typedef struct

{

xdict_t d;

rule_t r;

unsigned char *mblen;

unsigned int mode;

unsigned char *txt;

int zis;

int len;

int off;

int wend;

scws_res_t res0;

scws_res_t res1;

word_t **wmap;

struct scws_zchar *zmap;

}       scws_st, *scws_t;

是很核心的结构体。其中,使用txt保存用户输入的待分词字符串,使用off记录当前待处理的字节位置,res0为返回的分词结果,wmap就是上面输出的二维数组,zmap为wmap中每个词在txt中的index位置,d是字典,r是rule。

1、在scws_get_result函数中,首先按照最简单的规则,对文字进行分段,比如字母、数字、标点、换行符等,都被当做token来分段。

2、 对于每一段,首先初始化wmap二维数组,在它的对角线上存储每个字。

3、 然后依次访问wmap的对角线,向后匹配,在字典中查找,若找到匹配结果,则存储在wmap的行里。比如

[0,0]Ox11(中)
[0,1]Ox83(中国)
[0,2]Ox81(中国人)

就是从对角线的第一个元素,依次匹配第二个、第三个元素的结果。

4、 根据rule,进行规则过滤

5、 对wmap中的元素遍历,根据tf、idf计算weight,找出weight最大的path,作为最优分词结果。

6、 清理内存,返回。

scws_get_result的返回结果是二维的。每一次返回的是当前段的处理结果,是一个链表。该结果的next,是下一段的处理结果。

所以需要

while (res = cur = scws_get_result(s))

{

while (cur != NULL)

{

printf(“Word: %.*s/%s (IDF = %4.2f)\n”,

cur->len, text+cur->off, cur->attr, cur->idf);

cur = cur->next;

}

scws_free_result(res);

printf(“\n———————\n”);

}

我需要招一个人,来帮忙我做事情。今天跟leader请教了一些面试的注意点。

首先,招人来,是帮我的,是由于我的时间和精力不够。所以,他并不一定要是最优秀的,但是必须是比较适合我的,从技术能力、性格、和他愿意做的事情上。

然后,我们的面试,时间不像某些公司那样,会拖的很长,基本就是一锤子搞定,一个小时之内。所以,题目不能太大,但是需要能看出他分析问题和解决问题的思路是否清晰。

再然后,想想我现在让他来是做什么的呢?

SEO方面,希望他能够帮我写一些脚本,分析log,能够找到spider时间、规律等;能够脚本分析baidu、google中,我们页面针对某关键词的排名、快照时间等。所以,这里我希望他是了解一些正则表达式、会某一种脚本语言的(当然,从维护性考虑,最好是php或者perl)。

搜索方面,现在的搜索肯定一下子不会大动。但是希望他能搞明白现有搜索的框架、了解代码结构,找出有哪些可以改进的点,以提高我们搜索的准确度。这要求他了解分词、sphinx、hlseg(或者类似的),能够看懂大量的代码(c、java、php、perl、bash)。

推荐算法方面,这里还是主要我做,但是希望他也能够了解并使用hadoop框架,能够维护我的代码;并且能够和我一起讨论算法。

所以,我的关注点应是:

1、他是否愿意写代码,如果能够喜欢就太好了

2、面对一个新的问题,用怎样的思路去分析和解决

3、怎样快速学习并掌握一个新东西

4、有怎样的代码基础和经验(如果基础很好,经验不重要; 但是如果基础和经验都很好,那就太好了)

5、性格方面,是否勤奋、聪明,刚开始的时候,能够忍耐无聊的工作(我相信之后的工作是很有趣的)

zz from : http://arstechnica.com/science/news/2010/02/recommendation-algorithm-wants-to-show-you-something-new.ars

When it comes to recommendation systems, everybody’s looking to increase accuracy: the Netflix Prize was awarded last July for an algorithm that improved the accuracy of the service’s recommendation algorithm by 10 percent. However, computer scientists are finding a new metric to improve upon: recommendation diversity. In a paper that will be released by PNAS, a group of scientists are pushing the limits of recommendation systems, creating new algorithms that will make more tangential recommendations to users, which can help expand their interests, which will increase the longevity and utility of the recommendation system itself.

Accuracy has long been the most prized measurement in recommending content, like movies, links, or music. However, computer scientists note that this type of system can narrow the field of interest for each user the more it is used. Improved accuracy can result in a strong filtering based on a user’s interests, until the system can only recommend a small subset of all the content it has to offer.

The authors of the paper also note that accurate recommendations are not always useful. For example, suggesting one generic romantic comedy after another (say What Happens in Vegas and Just Married) just because a user rated When Harry Met Sally five stars is not helpful. Systems that base recommendations on correlations between users can miss niche items that a user might like, but would never find on his own. Research indicates that the most interesting recommendations and information originate from “weak ties” in a system, between users that are somewhat similar but disparate enough that they can introduce novelty to each other.

To widen the potential field of user interest, the authors developed a hybrid of two algorithms. One combined an algorithm that based its recommendations on random walks between highly connected users and material; the other mirrored the process of heat diffusion, spreading ratings at a decreasing level of potency as the recommendation had to travel further. The heat diffusion algorithm can be thought of as a system that has users connected in a network with the objects they have interacted with and evaluated, and values are passed among the items in this network to develop ratings.

The head diffusion model uses values of 1 or 0 for the material to be recommended—either a user liked something or he didn’t—and takes an average of the total resources a user had assigned to an object to give the user a value. For example, if a user liked two things and disliked two others, the value assigned to the user would be one-half.

The algorithm then averaged these values for any users connected to an object, and this became the object’s value in the system (for example, if two users were attached to an object and one had a value of one-half and the other had zero, the new value assigned to the object would be one quarter). All of this can be done using a small set of data, meaning the heat diffusion algorithm can make diverse yet relevant recommendations based onsparse data in one pass.

To test the algorithms individually and in hybrid form, scientists used data sets from Netflix, Rate Your Music and del.ici.ous, reducing ratings of various numbers of stars to likes or dislikes (three stars out of five and six out of ten qualified as a “like” in Netflix and RYM, respectively). They removed 10 percent of the selections from the data sets, and then applied the algorithms to test how much of the deleted data they could recover, as well as how many new and relevant selections the algorithms could make.

Combining the heat diffusion approach with the safer and more accurate random walk, the researchers found that they could create a body of recommendations that combined novelty items and safer, more accurate pieces. More importantly, using both allowed for more accurate recommendations than using either alone.

The hybrid took the form of a linear combination of the random walk and the heat diffusion algorithms, and the influence of each could be tuned by adjusting their coefficients to create more novelty or more accuracy as needed. This might allow for a system where a user could adjust the recommendations according to how interested they are in seeing something that may be outside of their normal content. The authors also noted that adding a global ranking algorithm that recommends items based on overall popularity could improve accuracy when little is known about the user.

While the accuracy of recommendations has been the prized focus (literally) in these systems, diversity and novelty are prized measures too (think of all those friends who boast about liking bands or movies before they were popular). The algorithms are still largely experimental, and the authors note that there is a significantly higher computational cost associated with using a hybrid algorithm. Nonetheless, diversity of suggestions seems to be the next horizon in refining recommendation systems.

PNAS, 2010. DOI: 10.1073/pnas.1000488107