SCSMiner

Introduction

With the advent of social coding sites, software development has entered a new era of collaborative work. Social coding sites (e.g., GitHub) can integrate social networking and distributed version control in a unified platform to facilitate collaborative developments over the world. One unique characteristic of such sites is that the past development experiences of developers provided on the sites convey the implicit metrics of developer's programming capability and expertise, which can be applied in many areas, such as software developer recruitment for IT corporations. Motivated by this intuition, we aim to develop a framework to effectively locate the developers with right coding skills. To achieve this goal, we devise a generative probabilistic expert ranking model upon which a consistency among projects is incorporated as graph regularization to enhance the expert ranking and a perspective of relevance propagation illustration is introduced. For evaluation, StackOverflow is leveraged to complement the ground truth of expert. Finally, a prototype system, SCSMiner, which provides expert search service based on a real-world dataset crawled from GitHub is implemented and demonstrated.

Sources

As we describe before, our experiments are conducted on the GitHub dataset. However, the most significantly difficulty is the lack of ground truth of whether the retrieved developer is expert or not. One simple approach is to select some queries and judge the relevance of retrieved results manually. However, this approach can not be implemented in large scale and may be affected by many human factors. As such, we deliberately utilize StackOverflow as an indirect ground truth to conduct evaluation.

Dataset1: Intersection of GitHub and StackOverflow download

To intersect data from GitHub and StackOverflow, a conservative approach matching email address is adopted in our experiment. In the GitHub dataset email address are present, while in the StackOverflow dataset email address are obscured, but their MD5 hashed are available. Therefore, we merge a GitHub and a StackOverflow user if their MD5 email hashes are identical. Furthermore, for computation simplicity, only those users whose reputation in StackOverflow is greater than 5 and followers number in GitHub is greater than 10 are considered. Finally, we obtain 16,567 users and 458,639 related projects.

Users

Property Example Description
idx 1 ordth of item in our dataset
_id 3502408 id in GitHub
id 3502408 id in GitHub
login jermsull login username
followers [40710,1466314,5891,40710,1466314,5891] id set of followers
following [40710,184706,991018,1123350,1466314,40710] id set of follwoing
repos [2325298, 2319498, 2325298, 2319498] id set of repos, the user contributes to
location Hangzhou, China
foundation Linux Foundation
organizations [4604446, 4604446] id set of organization the user belongs to
hireable true is the user hireable, can be used for recruitment
name Yao Wan
site_admin false
created_at 2011-09-03T15:26:22Z
updated_at 2015-06-11T00:46:13Z
public_repos 2 public repos count
public_gists 0 public gists count
starred [44658187,10202180,9393759,44658187,10202180,9393759] id set of starred repos
bio 'My name is Yao Wan, a PhD candidate from Zhejiang University.'
email wanyao@zju.edu.cn
blog wanyao.me
company Google

Sample: { "_id" : 3502408, "public_repos" : 2, "repos" : [ 40262919, 37088056, 40262919, 37088056 ], "site_admin" : false, "updated_at" : "2015-12-08T15:08:27Z", "hireable" : null, "id" : 3502408, "blog" : "http://torchlite.com", "followers" : [ 40710, 1466314, 5891, 40710, 1466314, 5891 ], "location" : "Indianapolis, Indiana", "type" : "User", "email" : "sullivan.jeremy@gmail.com", "bio" : null, "company" : "Torchlite Marketing", "login" : "jermsull", "organizations" : [ ], "public_gists" : 0, "name" : "Jeremy Sullivan", "idx" : 1, "created_at" : "2013-02-07T14:57:52Z", "following" : [ 40710, 184706, 991018, 1123350, 1466314, 40710, 184706, 991018, 1123350, 1466314 ], "starred" : [ 44658187, 10202180, 9393759, 44658187, 10202180, 9393759 ] }

Repos

Property Example Description
idx 1
id 53
_id 53
has_wiki true
mirror_url null
contributors [[27,269],[12951,105],[303,66],[16061,13],[8596,4],[2459,4],[281,2]] contributors and the corresponding contribution
subscribers_count 1
private false
full_name anotherjesse/taboo
owner 27
size 781
network_count 11
languages {"JavaScript" : 82130,"Ruby" : 1735} languages of the repos contains
watchers_count 21
forks 11
homepage http://overstimulate.com/projects/taboo
fork false
has_downloads true
has_pages false
default_branch master
language JavaScript main language of the repos
subscribers [11713940]
stargazers_count 21
open_issues_count 1
watchers 21
name taboo
forks_count 11
stargazers [27,281,303,604,1423,7853,8596,12951,16061,21679,33067]
open_issues 1
created_at 2008-01-15T08:13:02Z
pushed_at 2010-01-21T07:08:09Z
updated_at 2015-11-12T15:38:16Z
description The solution for tabitus of the browser description of repos
readme_name README
readme KiBCZWZvcmUgcmVsZWFzZSAwLjIwIFswLzhdcwo=\n...... Md5 of readme

Sample: { "_id" : 53, "has_wiki" : true, "mirror_url" : null, "contributors" : [ [ 27, 269 ], [ 12951, 105 ], [ 303, 66 ], [ 16061, 13 ], [ 8596, 4 ], [ 2459, 4 ], [ 281, 2 ] ], "subscribers_count" : 1, "updated_at" : "2015-11-12T15:38:16Z", "private" : false, "readme_name" : "README", "full_name" : "anotherjesse/taboo", "owner" : 27, "id" : 53, "size" : 781, "network_count" : 11, "languages" : { "JavaScript" : 82130, "Ruby" : 1735 }, "idx" : 1, "watchers_count" : 21, "readme" : "KiBCZWZvcmUgcmVsZWFzZSAwLjIwIFswLzhdCiAgLSBbIF0gbnVsbCB0aXRs\n......", "forks" : 11, "homepage" : "http://overstimulate.com/projects/taboo", "fork" : false, "description" : "The solution for tabitus of the browser ", "has_downloads" : true, "has_pages" : false, "default_branch" : "master", "subscribers" : [ 11713940 ], "has_issues" : true, "stargazers_count" : 21, "open_issues_count" : 1, "watchers" : 21, "name" : "taboo", "language" : "JavaScript", "stargazers" : [ 27, 281, 303, 604, 1423, 7853, 8596, 12951, 16061, 21679, 33067, 51633, 73144, 81024, 144334, 144384, 472094, 986639, 1194992, 778015, 5877145 ], "created_at" : "2008-01-15T08:13:02Z", "pushed_at" : "2010-01-21T07:08:09Z", "forks_count" : 11, "open_issues" : 1 }

Ground Truth

In StackOverflow, each user answers many programming problems and the tags and their corresponding count of those problems they answer are collected. In terms of the retrieved results, we consider the retrieved developer who has also answered some corresponding problems in StackOverflow as relevant, and the count of answering questions can be taken as the degree of relevance.

Property Example Description
id 1266639 user id in GitHub
tag [u'sql-server===11', u'sql-server-2005===11', u'sql===8', u'stored-procedures===7', u'struts===3', u'.net===3', u'.net-2.0===3', u'asp.net===3', u'jsp===3', u'jsp-tags===3', u'date===2', u'java===2', u'web-applications===2', u'time===1', u'tsql===1', u'dateadd===1', u'cursor===1'] tag and the corresponding questions count the user anstwes in StackOverflow

Sample:1266639, [u'sql-server===11', u'sql-server-2005===11', u'sql===8', u'stored-procedures===7', u'struts===3', u'.net===3', u'.net-2.0===3', u'asp.net===3', u'jsp===3', u'jsp-tags===3', u'date===2', u'java===2', u'web-applications===2', u'time===1', u'tsql===1', u'dateadd===1', u'cursor===1']

Dataset2: Human labeled GitHub data download

Since the some popular users in GitHub may not appear in StackOverflow, and to have a clear understanding on the retrieved results, we also evaluate our model on some popular users. We extract projects whose star number is greater than 400 firstly, and then extract users who make contributions to the pro jects. Finally, we obtain 11,437 pro jects and 158,646 users.

Users

Property Example Description
idx 1
user_id 917903
login talbs
user_type User
site_admin False
name Brian
company Talbot,MIT/edX
location Boston, MA
email hi.talbs@gmail.com
bio "the bio of user"
public_repos 1
public_gists 2
followers_count 9
following_count 7
created_at 1260195294 time stamp
updated_at 1433531131 time stamp
Sample: 2 2,163763,talbs,,False,Brian Talbot,MIT/edX,"Boston, MA",hi.talbs@gmail.com,,9,7,3

Repos

Property Example Description
idx 1
repos_id 1
name grit
full_name mojombo/grit full name of repos
owner
created_at 1193668636 time stamp
updated_at 1433946354 time stamp
homepage http://grit.rubyforge.org
language Ruby main language of repos
size 7954
forks_count 448
stagazers_count 1857
watchers_count 1857
forks_count 448
open_issues_count 2
subscribers_count 60
description **Grit is no longer maintained. Check out libgit2/rugged.** Grit gives you object oriented read/write access to Git repositories via Ruby.
readme_name README.md
readme_path README.md
readme_type file
readme_size 6641 byte
readme_content dWdnZWQpLioqCgpHcml0IGdpdmVzIHlvdSBvYmplY3Qgb3JpZW50ZWQgcmVh Md5 of readme content
readme_encoding base64
readme_description README description... README text
Sample: 1,1,grit,mojombo/grit,,1193668636,1433946354,http://grit.rubyforge.org/,Ruby,7954,448,1857,1857,448,2,60,**Grit is no longer maintained. Check out libgit2/rugged.** Grit gives you object oriented read/write access to Git repositories via Ruby.,README.md,README.md,file,6641,"R3JpdAo9PT09CgoqKkdyaXQgaXMgbm8gbG9uZ2VyIG1ha",base64,"README content..."

Relationship between users and repos

Property Example Description
idx 1
fullname mojombo/grit fullname of repos
contributors mojombo***177===schacon***67===rtomayko***64===technoweenie***49 contributors and corresponding contributions

Sample: 1,mojombo/grit,mojombo***177===schacon***67===rtomayko***64===technoweenie***49===defunkt***34===pjhyett***19===
rsanheim***12===js***10===tmm1***9===therealadam***4===chapados***4===halorgium***4===dkowis***4===davetron5000***3
===peff***3===Voker57***3===hans***2===cristibalan***2===koraktor***2===dysinger***2===vmg***2===bkeepers***2
===cho45***2===bobbywilson0***1===darwin***1===cehoffman***1===dustin***1===franckverrot***1===hiroshi***1===igorw***1
===shepmaster***1===josb***1===itspriddle***1===kamal***1===kevinsawicki***1===martint***1===binki***1===pda***1===
smtlaissezfaire***1===sbryant***1===sr***1===wvl***1===holman***1

Ground Truth

File name Description
angular_GT.csv ground truth of "angular"
app_GT.csv ground truth of "app"
framework_GT.csv ground truth of "framework"
game_GT.csv ground truth of "game"
image_GT.csv ground truth of "image"
javascript_GT.csv ground truth of "javascript"
linux_GT.csv ground truth of "linux"
plugin_GT.csv ground truth of "plugin"

Download

The dataset of Github has been upload to my Baidu Cloud successfully, you can download it via the following link and password.
link: http://pan.baidu.com/s/1bpI7ScR
retrieval password: dvjh

Acknowledgement

To appear