With the advent of social coding sites, software development has entered a new era of collaborative work. Social coding sites (e.g., GitHub) can integrate social networking and distributed version control in a unified platform to facilitate collaborative developments over the world. One unique characteristic of such sites is that the past development experiences of developers provided on the sites convey the implicit metrics of developer's programming capability and expertise, which can be applied in many areas, such as software developer recruitment for IT corporations. Motivated by this intuition, we aim to develop a framework to effectively locate the developers with right coding skills. To achieve this goal, we devise a generative probabilistic expert ranking model upon which a consistency among projects is incorporated as graph regularization to enhance the expert ranking and a perspective of relevance propagation illustration is introduced. For evaluation, StackOverflow is leveraged to complement the ground truth of expert. Finally, a prototype system, SCSMiner, which provides expert search service based on a real-world dataset crawled from GitHub is implemented and demonstrated.
As we describe before, our experiments are conducted on the GitHub dataset. However, the most significantly difficulty is the lack of ground truth of whether the retrieved developer is expert or not. One simple approach is to select some queries and judge the relevance of retrieved results manually. However, this approach can not be implemented in large scale and may be affected by many human factors. As such, we deliberately utilize StackOverflow as an indirect ground truth to conduct evaluation.
To intersect data from GitHub and StackOverflow, a conservative approach matching email address is adopted in our experiment. In the GitHub dataset email address are present, while in the StackOverflow dataset email address are obscured, but their MD5 hashed are available. Therefore, we merge a GitHub and a StackOverflow user if their MD5 email hashes are identical. Furthermore, for computation simplicity, only those users whose reputation in StackOverflow is greater than 5 and followers number in GitHub is greater than 10 are considered. Finally, we obtain 16,567 users and 458,639 related projects.
Property | Example | Description |
---|---|---|
idx | 1 | ordth of item in our dataset |
_id | 3502408 | id in GitHub |
id | 3502408 | id in GitHub |
login | jermsull | login username |
followers | [40710,1466314,5891,40710,1466314,5891] | id set of followers |
following | [40710,184706,991018,1123350,1466314,40710] | id set of follwoing |
repos | [2325298, 2319498, 2325298, 2319498] | id set of repos, the user contributes to |
location | Hangzhou, China | |
foundation | Linux Foundation | |
organizations | [4604446, 4604446] | id set of organization the user belongs to |
hireable | true | is the user hireable, can be used for recruitment |
name | Yao Wan | |
site_admin | false | |
created_at | 2011-09-03T15:26:22Z | |
updated_at | 2015-06-11T00:46:13Z | |
public_repos | 2 | public repos count |
public_gists | 0 | public gists count |
starred | [44658187,10202180,9393759,44658187,10202180,9393759] | id set of starred repos |
bio | 'My name is Yao Wan, a PhD candidate from Zhejiang University.' | |
wanyao@zju.edu.cn | ||
blog | wanyao.me | |
company |
Sample: { "_id" : 3502408, "public_repos" : 2, "repos" : [ 40262919, 37088056, 40262919, 37088056 ], "site_admin" : false, "updated_at" : "2015-12-08T15:08:27Z", "hireable" : null, "id" : 3502408, "blog" : "http://torchlite.com", "followers" : [ 40710, 1466314, 5891, 40710, 1466314, 5891 ], "location" : "Indianapolis, Indiana", "type" : "User", "email" : "sullivan.jeremy@gmail.com", "bio" : null, "company" : "Torchlite Marketing", "login" : "jermsull", "organizations" : [ ], "public_gists" : 0, "name" : "Jeremy Sullivan", "idx" : 1, "created_at" : "2013-02-07T14:57:52Z", "following" : [ 40710, 184706, 991018, 1123350, 1466314, 40710, 184706, 991018, 1123350, 1466314 ], "starred" : [ 44658187, 10202180, 9393759, 44658187, 10202180, 9393759 ] }
Property | Example | Description |
---|---|---|
idx | 1 | |
id | 53 | |
_id | 53 | |
has_wiki | true | |
mirror_url | null | |
contributors | [[27,269],[12951,105],[303,66],[16061,13],[8596,4],[2459,4],[281,2]] | contributors and the corresponding contribution |
subscribers_count | 1 | |
private | false | |
full_name | anotherjesse/taboo | |
owner | 27 | |
size | 781 | |
network_count | 11 | |
languages | {"JavaScript" : 82130,"Ruby" : 1735} | languages of the repos contains |
watchers_count | 21 | |
forks | 11 | |
homepage | http://overstimulate.com/projects/taboo | |
fork | false | |
has_downloads | true | |
has_pages | false | |
default_branch | master | |
language | JavaScript | main language of the repos |
subscribers | [11713940] | |
stargazers_count | 21 | |
open_issues_count | 1 | |
watchers | 21 | |
name | taboo | |
forks_count | 11 | |
stargazers | [27,281,303,604,1423,7853,8596,12951,16061,21679,33067] | |
open_issues | 1 | |
created_at | 2008-01-15T08:13:02Z | |
pushed_at | 2010-01-21T07:08:09Z | |
updated_at | 2015-11-12T15:38:16Z | |
description | The solution for tabitus of the browser | description of repos |
readme_name | README | |
readme | KiBCZWZvcmUgcmVsZWFzZSAwLjIwIFswLzhdcwo=\n...... | Md5 of readme |
Sample: { "_id" : 53, "has_wiki" : true, "mirror_url" : null, "contributors" : [ [ 27, 269 ], [ 12951, 105 ], [ 303, 66 ], [ 16061, 13 ], [ 8596, 4 ], [ 2459, 4 ], [ 281, 2 ] ], "subscribers_count" : 1, "updated_at" : "2015-11-12T15:38:16Z", "private" : false, "readme_name" : "README", "full_name" : "anotherjesse/taboo", "owner" : 27, "id" : 53, "size" : 781, "network_count" : 11, "languages" : { "JavaScript" : 82130, "Ruby" : 1735 }, "idx" : 1, "watchers_count" : 21, "readme" : "KiBCZWZvcmUgcmVsZWFzZSAwLjIwIFswLzhdCiAgLSBbIF0gbnVsbCB0aXRs\n......", "forks" : 11, "homepage" : "http://overstimulate.com/projects/taboo", "fork" : false, "description" : "The solution for tabitus of the browser ", "has_downloads" : true, "has_pages" : false, "default_branch" : "master", "subscribers" : [ 11713940 ], "has_issues" : true, "stargazers_count" : 21, "open_issues_count" : 1, "watchers" : 21, "name" : "taboo", "language" : "JavaScript", "stargazers" : [ 27, 281, 303, 604, 1423, 7853, 8596, 12951, 16061, 21679, 33067, 51633, 73144, 81024, 144334, 144384, 472094, 986639, 1194992, 778015, 5877145 ], "created_at" : "2008-01-15T08:13:02Z", "pushed_at" : "2010-01-21T07:08:09Z", "forks_count" : 11, "open_issues" : 1 }
In StackOverflow, each user answers many programming problems and the tags and their corresponding count of those problems they answer are collected. In terms of the retrieved results, we consider the retrieved developer who has also answered some corresponding problems in StackOverflow as relevant, and the count of answering questions can be taken as the degree of relevance.
Property | Example | Description |
---|---|---|
id | 1266639 | user id in GitHub |
tag | [u'sql-server===11', u'sql-server-2005===11', u'sql===8', u'stored-procedures===7', u'struts===3', u'.net===3', u'.net-2.0===3', u'asp.net===3', u'jsp===3', u'jsp-tags===3', u'date===2', u'java===2', u'web-applications===2', u'time===1', u'tsql===1', u'dateadd===1', u'cursor===1'] | tag and the corresponding questions count the user anstwes in StackOverflow |
Sample:1266639, [u'sql-server===11', u'sql-server-2005===11', u'sql===8', u'stored-procedures===7', u'struts===3', u'.net===3', u'.net-2.0===3', u'asp.net===3', u'jsp===3', u'jsp-tags===3', u'date===2', u'java===2', u'web-applications===2', u'time===1', u'tsql===1', u'dateadd===1', u'cursor===1']
Since the some popular users in GitHub may not appear in StackOverflow, and to have a clear understanding on the retrieved results, we also evaluate our model on some popular users. We extract projects whose star number is greater than 400 firstly, and then extract users who make contributions to the pro jects. Finally, we obtain 11,437 pro jects and 158,646 users.
Property | Example | Description |
idx | 1 | |
user_id | 917903 | |
login | talbs | |
user_type | User | |
site_admin | False | |
name | Brian | |
company | Talbot,MIT/edX | |
location | Boston, MA | |
hi.talbs@gmail.com | ||
bio | "the bio of user" | |
public_repos | 1 | |
public_gists | 2 | |
followers_count | 9 | |
following_count | 7 | |
created_at | 1260195294 | time stamp |
updated_at | 1433531131 | time stamp |
Property | Example | Description |
idx | 1 | |
repos_id | 1 | |
name | grit | |
full_name | mojombo/grit | full name of repos |
owner | ||
created_at | 1193668636 | time stamp |
updated_at | 1433946354 | time stamp |
homepage | http://grit.rubyforge.org | |
language | Ruby | main language of repos |
size | 7954 | |
forks_count | 448 | |
stagazers_count | 1857 | |
watchers_count | 1857 | |
forks_count | 448 | |
open_issues_count | 2 | |
subscribers_count | 60 | |
description | **Grit is no longer maintained. Check out libgit2/rugged.** Grit gives you object oriented read/write access to Git repositories via Ruby. | |
readme_name | README.md | |
readme_path | README.md | |
readme_type | file | |
readme_size | 6641 | byte |
readme_content | dWdnZWQpLioqCgpHcml0IGdpdmVzIHlvdSBvYmplY3Qgb3JpZW50ZWQgcmVh | Md5 of readme content |
readme_encoding | base64 | |
readme_description | README description... | README text |
Property | Example | Description |
idx | 1 | |
fullname | mojombo/grit | fullname of repos |
contributors | mojombo***177===schacon***67===rtomayko***64===technoweenie***49 | contributors and corresponding contributions |
Sample: 1,mojombo/grit,mojombo***177===schacon***67===rtomayko***64===technoweenie***49===defunkt***34===pjhyett***19===rsanheim***12===js***10===tmm1***9===therealadam***4===chapados***4===halorgium***4===dkowis***4===davetron5000***3===peff***3===Voker57***3===hans***2===cristibalan***2===koraktor***2===dysinger***2===vmg***2===bkeepers***2===cho45***2===bobbywilson0***1===darwin***1===cehoffman***1===dustin***1===franckverrot***1===hiroshi***1===igorw***1===shepmaster***1===josb***1===itspriddle***1===kamal***1===kevinsawicki***1===martint***1===binki***1===pda***1===smtlaissezfaire***1===sbryant***1===sr***1===wvl***1===holman***1
File name | Description |
angular_GT.csv | ground truth of "angular" |
app_GT.csv | ground truth of "app" |
framework_GT.csv | ground truth of "framework" |
game_GT.csv | ground truth of "game" |
image_GT.csv | ground truth of "image" |
javascript_GT.csv | ground truth of "javascript" |
linux_GT.csv | ground truth of "linux" |
plugin_GT.csv | ground truth of "plugin" |
The dataset of Github has been upload to my Baidu Cloud successfully, you can download it via the following link and password. link: http://pan.baidu.com/s/1bpI7ScR retrieval password: dvjh
To appear