Skip to main content

如何使用 OpenDigger MCP Server 定制你的开源数据报告

· 9 min read
Frank Zhao
Ph.D candidate at X-lab, author of OpenDigger

近期 MCP 概念在开源生态中非常火热,OpenDigger 也实现并开源了自己的第一版 MCP 服务,并通过对于 Kubernetes 项目的分析验证了利用大模型实时获取开源数据指标并进行分析的可行性。

什么是 MCP?

MCP (Model Context Protocol,模型上下文协议) 是由 Claude 的母公司 Anthropic 在 2024 年底推出的一种开放协议,它通过提供一种标准化的接口,旨在实现大语言模型(LLM)与外部数据源及工具的无缝集成。MCP 服务可以提供如静态资源(Resource)、工具调用(Tool)、LLM 提示词(Prompt)等多种不同类型的能力,方便支持 MCP 的工具无缝访问外部的数据源或进行自动化工具调用,使大模型在生成过程中可以使用这些能力来辅助和增强生成效果。

近期 MCP 技术在开发者中热度逐渐攀升,诸多的 AI 编辑器(如 Cursor,Windsurf)、VSCode 插件(如 Cline)、聊天客户端(如 Cherry Studio,NextChat)等都开始纷纷支持了 MCP 能力。而 Anthropic 为 MCP 开发的多语言 SDK 也可以使开发者快速开发自己的 MCP 服务,因此除了官方提供了大量主流平台的 MCP 服务外,开源生态中开始涌现出大量的 MCP 服务项目。

OpenDigger MCP Server

OpenDigger 旨在为开源项目提供全面有效的开源数据指标,OpenDigger 所生产的数据指标一直以来被大量的下游应用所使用(如 HyperCRX、OpenLeaderboard、OpenGalaxy 等),然而这些应用都没有自主进行数据分析和洞察的能力。

大语言模型具有极强的文本生成能力,这对于数据洞察有极佳的辅助作用,但如何在生成过程中动态引用真实数据,生成有效的数据报告也是近期一个研究的难点与热点,而 MCP 则为 LLM 生成数据报告时动态提供线上数据带来了一种新的实现方式。

OpenDigger 也在 X-lab 的 GitHub 上开源了第一版基础的 MCP 服务(X-lab2017/open-digger-mcp-server),该服务提供了如下两个功能:

  • 数据指标获取工具(Tool):该工具可以实时在线获取 OpenDigger 生产的开源项目数据指标文件,供 LLM 进行分析洞察及后续的生成过程。
  • 数据报告生成提示词(Prompt):该提示词会向 LLM 解释各类指标的具体含义,并帮助开发者快速生成一个可直接在网页端预览的数据报告。

安装该 MCP 服务后便可以在调用 LLM 生成开源数据洞察报告时调用 OpenDigger 的指标数据,以便进行数据可视化及数据洞察。

数据报告示例

本文以 Cline 插件为例,展示在具有在线数据访问能力后,如何使用 DeepSeek-V3 来生成开源项目的洞察报告。

在本地安装 OpenDigger MCP Server 后,启用该服务,并开启 MCP 的 Auto-approve 选项,以便自动进行数据获取。之后使用该项目提供的 Prompt 让 DeepSeek-V3 模型来生成一个 Kubernetes 主仓库的数据报告。

根据上图,我们可以看到,大模型在接到任务后先对任务进行了分析,分解为如下步骤:

  • 使用 MCP 服务来获取该仓库的 OpenRank、Star、Participants、Contributors 四个数据指标
  • 根据仓库的创建年限来确定数据分析使用的数据粒度(年度、季度、月度)
  • 生成一个 HTML 来展示数据的可视化效果及趋势解读
  • 使用 Chart.js 组件来进行数据可视化

随后大模型自动调用了 MCP 的 get_open_digger_metric 工具来获取数据文件并得到了相应的数据,并根据仓库创建时间选择使用年度数据作为分析粒度,分析数据后大模型直接在编辑器中创建了一个名为 kubernetes-report.html 的文件,并将年度的数据趋势与解读内容生成到该文件中,最后提示用户使用命令行在浏览器中打开该网页。

整个过程一气呵成,用户仅需提供需求,后续的数据获取与可视化报告生成全部由大模型配合 MCP 服务逐步完成。

下图是最终页面中 OpenRank 指标的可视化及解读效果:

根据上图,DeepSeek-V3 模型先将 Kubernetes 主仓库的 OpenRank 年度指标数据使用 Chart.js 组件绘制出来,然后给出了具体的洞察内容。它根据数据的趋势将 Kubernetes 主仓库的发展阶段分为了:

  • 2015 至 2017 年:快速发展期,OpenRank 指标在快速增长,该技术作为容器编排平台被快速认知和使用。
  • 2018 至 2019 年:平稳成熟期,OpenRank 指标维持在相对平稳的状态,几乎没有太多变化。
  • 2020 至 2022 年:缓慢下降期,OpenRank 指标开始逐渐下降,但其也指出这背后可能存在的多种因素,如发展逐渐稳定、开发者更多在扩展的生态项目中活跃、容器技术的标准化完成等。
  • 2023 至今:近期趋势相对稳定,甚至在 2023 年还略有回升,中间月度数据也存在震荡,可能是由于发版或特定特性带来的。

可以看到,DeepSeek-V3 在生成数据报告过程中可以正确的识别 MCP 服务提供的接口和参数,并正确的调用接口得到数据,之后正确的生成了 HTML 文件对数据进行可视化并提供了数据的洞察分析内容。令人惊艳的是,虽然使用了年度数据进行分析,但在近两年的数据分析中,模型也同时使用了月度数据进行了细致的说明。

结论

MCP 是目前大模型生态中最有优势的大模型交互接口协议,已经发展出了繁荣的开源生态,有大量的开发者在上下游中围绕 MCP 进行开发和创作。OpenDigger 也通过实现自己的 MCP 服务验证了利用大模型(如 DeepSeek-V3)进行定制化数据分析的能力,有兴趣的小伙伴欢迎体验和共建。

2025 年 2 月开源生态数据洞察报告

· 7 min read
Frank Zhao
Ph.D candidate at X-lab, author of OpenDigger
Will Wang
Prof. @ ECNU / Founder of X-lab

OpenRank 指标是对工信部电子标准院的“信息技术 开源治理”系列标准中评价指标的开源实现,能够有效反映开源项目在开发者中的协作影响力,从而帮助我们了解开源世界,发现开源趋势,洞察开源事件。

热点事件:DeepSeek 开源周引爆全球 LLM 基础优化技术

继 2025 年 1 月 DeepSeek 发布轰动全球的 DeepSeek-R1 模型后,2025 年 2 月 21 日宣布启动为期一周的“开源周”计划,从 2 月 24 日其连续五天开源一项核心技术,旨在推动 AI 技术共享与行业应用加速。这五项技术共开放了 7 个开源仓库,根据 OpenDigger 数据,凭借开源周的热度,DeepSeek 在 GitHub 上的组织在 2 月 24 日至 3 月 6 日间共获得 56.2k+ Star,有 805 位开发者参与到讨论和协作中。最终 DeepSeek 在企业 OpenRank 中再次强劲增长近 60%,达到 330 并进军到中国企业榜第 11 位。

开源周中提到的这五项技术包括:

  • Day 1 - FlashMLA
    • 面向 Hopper GPU 的高效 MLA(多头潜在注意力机制)解码内核,针对可变长度序列优化算力分配,显著降低推理成本。
  • Day 2 - DeepEP
    • 首个专为 MoE(混合专家)模型设计的 EP(专家并行)通信库,支持 FP8 低精度计算,提升 GPU 间通信效率 10 倍,兼顾高吞吐与低延迟。
  • Day 3 - DeepGEMM
    • 基于 FP8 精度的通用矩阵乘法加速库,代码仅 300 行,高效优化深度学习矩阵运算,提升训练与推理效率。
  • Day 4 - 并行策略三连发
    • DualPipe:双向流水线并行算法,优化模型训练流程。
    • EPLB:MoE 负载均衡算法,解决专家模型资源分配不均问题。
    • profile-data:公开训练框架数据,助力开发者复现与优化。
  • Day 5 - 3FS 分布式文件系统
    • 面向 AI 训练的高性能分布式存储系统,结合固态硬盘与 RDMA 网络,极致压榨硬件带宽,被评价为 “数据处理新标杆”。

从 Star 增长情况来看,FlashMLA 凭借先发优势,在第一天就斩获了 7k+ Star,截止 3 月 6 日共获得 Star 数 11.3k+。而最后一天发布的分布式文件系统 3FS 格外受到开发者的关注,发布当天就获得了近 4k Star,截止 3 月 6 日共获得了超过 8k Star。

从 Star 增长来看,2025 年 2 月 DeepSeek 的 Star 增长仍然遍布了全球 127 个国家和地区,且各国的贡献度比例与 1 月呈现类似的分布。对比上个月数据,中国的贡献度更为集中,以 70.69% 的比例领跑。而美国和印度分别以 8.08% 和 3.38% 位于第二梯队。后续为加拿大、英国、新加坡和巴西等国。

  • 作者点评:DeepSeek 开源周不仅是技术实力的展现,更是对 “开源精神” 的极致践行 —— 以开放代码推动行业共进,印证了 “越是开源,越能扩大生态” 的战略远见。

  • 进阶阅读:

本月推荐项目

DeepSeek 带来的热潮也开始对大模型基础技术生态产生重要影响,多个项目都受到其影响出现了爆发式的增长。

kvcache-ai/ktransformers

  • KTransformers 项目旨在提供基础模型的各类底层优化,2024 年 7 月开源以来一直没有太多关注度。2025 年 2 月,其开始支持对于 DeepSeek V3 和 R1 模型的优化,从减少推理显存、提升上下文长度等多个方面对模型进行了优化,2025 年 2 月该项目在创建半年多后迎来了爆发式增长,当月 OpenRank 影响力增长 34 倍达到 138,有 736 位开发者参与到了项目讨论和协作中,成为了一个现象级项目。
  • 仓库地址:https://github.com/kvcache-ai/ktransformers

huggingface/open-r1

  • DeepSeek-R1 发布后引发全球复现高潮,作为全球最模型托管平台,Hugging Face 也提供了一个完全开源的 DeepSeek-R1 的复现仓库 Open-R1,该仓库在开源后获得了 22.8k+ Star,2025 年 2 月有 359 位开发者参与到了讨论与协作中,OpenRank 值达到 88,成功进入全球仓库增长榜单。
  • 仓库地址:https://github.com/huggingface/open-r1

January 2025 Open Source Monthly Insight Report

· 7 min read
Frank Zhao
Ph.D candidate at X-lab, author of OpenDigger
Will Wang
Prof. @ ECNU / Founder of X-lab

The OpenRank metrics are an open-source implementation of the evaluation criteria outlined in the "Information Technology - Open Source Governance" series of standards developed by the Electronics Standards Institute of the Ministry of Industry and Information Technology. These metrics effectively reflect the collaborative influence of open-source projects among developers, thereby aiding our understanding of the open-source ecosystem, identifying emerging trends, and uncovering significant events.

The Global Impact of DeepSeek: Pioneering a New Era of AI

On January 20, 2025, the Chinese AI company DeepSeek unveiled its R1 series of large language models, causing a seismic shift in the global AI industry. Characterized by their low cost, high performance, and open-source nature, these models not only triggered a significant impact on the U.S. financial markets in the short term but also profoundly influenced the technological trajectory, industry landscape, and geopolitical dynamics of large language model development. This insight report will delve into DeepSeek's entire suite of models, providing a comprehensive data analysis.

Overview

DeepSeek launched its R1 inference model on GitHub on January 20, 2025, followed by the release of the Janus Pro multimodal on January 28. These models quickly gained global attention due to their exceptional cost-effectiveness and performance. From the release of the R1 model until February 6, DeepSeek's official GitHub organization garnered over 150,000 new stars, with 1,679 active developers contributing. Five of DeepSeek's repositories entered the top 300 list of Chinese OpenRank repositories in January 2025, with DeepSeek-R1 ranking at 62nd after just 10 days of being open-sourced. In the OpenRank enterprise rankings, DeepSeek scored 207 points in January 2025, rapidly ascending to the 86th position globally and 13th in China.

2025.1 OpenRank Leaderboard of Chinese Companies Top 15
#CompanyOpenRankActive Repos CountActive Developers Count
1
Huawei
10416.91increase/decrease441.38
3005increase/decrease93
4782increase/decrease1103
2
Alibaba
1822.95increase/decrease142.79
1410increase/decrease306
2026increase/decrease524
3
Ant group
1329.97increase/decrease97.46
542increase/decrease10
1671increase/decrease336
4
Baidu
1119.37increase/decrease83.37
192increase/decrease19
978increase/decrease249
5
ByteDance
684.21increase/decrease0.5
371increase/decrease2
1112increase/decrease185
6
ESPRESSIF
529.56increase/decrease23.4
168increase/decrease15
868increase/decrease69
7
Tencent
476.51increase/decrease56.4
237increase/decrease55
687increase/decrease285
8
DaoCloud
424.47increase/decrease89.53
49increase/decrease6
555increase/decrease186
9
PingCAP
423.89increase/decrease14.15
76increase/decrease11
252increase/decrease36
10
Fit2Cloud
419.89increase/decrease54.12
57increase/decrease1
348increase/decrease145
11
Zilliz
294.02increase/decrease6.32
44increase/decrease3
241increase/decrease34
12
StarRocks
215.46increase/decrease10.95
11
160increase/decrease33
13
DeepSeek
207.45increase/decrease172.47
16increase/decrease1
1386increase/decrease1207
14
openKylin
204.37increase/decrease59.26
117increase/decrease100
118increase/decrease96
15
Deepin
162.04increase/decrease9.12
122increase/decrease10
83increase/decrease3

Star Growth Analysis

The following chart illustrates the daily star growth for the five fastest-growing repositories under DeepSeek's GitHub account up to February 6. Notably, DeepSeek-R1's repository saw an immediate surge of over 2,000 stars on the day of its release, with daily increments ranging between 2,000 and 4,000 stars until January 26. The true explosion occurred on January 27 when the U.S. stock market experienced a sharp decline following the release of DeepSeek-R1. NVIDIA's stock plummeted by 17% on that day, leading to widespread recognition of DeepSeek-R1 and boosting the popularity of its base model V3 and the Janus Pro multimodal model released on January 28. On January 28, both V3 and R1 models saw star growth exceeding 10,000, while the Janus repository gained over 4,000 stars. Subsequently, the growth rate slowed down, with another spike observed on February 5 following the Lunar New Year holiday in China.

The distribution of star growth by country and region is depicted in the chart below. According to OpenDigger data, the 150,000 stars accumulated during this period originated from 185 countries and regions worldwide. On the day of the R1 release, stars came from 82 countries, with the United States contributing the most (28%), significantly surpassing China's share of 17.4%. Despite time zone differences, this highlights the rapid response and keen interest from U.S. developers. By January 28, the global impact peaked, with contributions from 149 countries. Brazil and South Korea were notable late entrants, while post-holiday activity on February 5 was predominantly driven by Chinese developers returning to work.

Participants Distribution

Although DeepSeek's models are primarily hosted on platforms like HuggingFace and ModelScope, GitHub has played a crucial role as a forum for developer discussions and Q&A sessions, far exceeding the volume of interactions on HuggingFace. Analyzing the global distribution of contributors based on OpenRank data reveals that China, the U.S., and India form the first tier of contributors. The second tier includes the UK, Brazil, and Germany, while Australia, Pakistan, and Singapore follow in the third tier. Notably, despite having fewer contributors, Singapore ranks highly in terms of contribution quality. Israel's growing tech sector is also reflected in this data.

Detailed analysis shows that DeepSeek has attracted numerous developers and enthusiasts who have been deeply involved in large language model research over the past six months. Prominent contributors include:

  • Krish Dholakia (@krrishdholakia), founder and CEO of LiteLLM (OpenRank 193)
  • Yineng Zhang (@zhyncs), core maintainer of SGLang (OpenRank 180)
  • Michael (@mldangelo), core maintainer of Promptfoo (OpenRank 46)
  • yetone (@yetone), author of avante.nvim (OpenRank 57)
  • Dev Khant (@Dev-Khant), co-founder of Mem0 AI (OpenRank 31)
  • Junyan Qin (@RockChinQ), author of LangBot
  • wong2 (@wong2), author of ChatHub
  • Dongbo Wang (@daxian-dbw) from Microsoft's PowerShell team on AIShell project
  • Wenhua Cheng (@wenhuach21) from Intel's AutoAround team

This data indicates that while North American developers show strong interest in using DeepSeek, they are less actively engaged in discussions. Conversely, Chinese and Indian developers have been more proactive in participating and collaborating.

Key Findings

The release of DeepSeek's large language models marks a significant milestone in the global AI landscape. Within two weeks of the R1 launch, DeepSeek's multiple GitHub repositories received over 150,000 stars, with nearly 1,700 active developers, underscoring the global recognition and enthusiasm for this innovation. DeepSeek's OpenRank score also saw a dramatic increase.

Key observations from the data include:

  • Global Reach: DeepSeek-R1's influence extends across almost all major countries and regions, showcasing its broad appeal.
  • Rapid Response from U.S. Developers: U.S. developers exhibited a high level of sensitivity to technological advancements, responding faster than Chinese developers initially.
  • Contributor Diversity: Contributions come from students, individual AI enthusiasts, and corporate AI project leaders or founders of AI startups, forming a balanced community.
  • Indian Engagement: Indian developers play a crucial role in this AI wave, actively collaborating with Chinese counterparts.
  • North American Observation: While North American developers show significant interest, many remain observers rather than active contributors, with more engagement from student and Chinese-American communities.

Conclusion

Historically, China has often been seen as a consumer in the open-source community, occasionally criticized for limited contributions. However, projects like DeepSeek are now leading the way, demonstrating not only technical breakthroughs but also fostering extensive global participation and contributions. We hope to see more European and North American developers deeply engage in the development of top-tier Chinese projects.

In summary, DeepSeek's success is not just a technological triumph but also a social and industrial milestone. It has attracted global developers to contribute to the advancement of AI, setting the stage for future innovations in artificial intelligence. We look forward to DeepSeek continuing to lead the global AI revolution, opening up new possibilities for humanity.

December 2024 Open Source Ecosystem Data Insight Report

· 4 min read
Frank Zhao
Ph.D candidate at X-lab, author of OpenDigger
Will Wang
Prof. @ ECNU / Founder of X-lab

The OpenRank indicator is an open source implementation of the evaluation indicators in the "Information Technology Open Source Governance" series of standards of the Electronic Standards Institute of the Ministry of Industry and Information Technology. It can effectively reflect the collaborative influence of open source projects among developers, thereby helping us understand the open source world, discover open source trends, and gain insight into open source events.

Hot Event 1: Ghostty is released, and it is still young again

  • Data Facts: According to OpenDigger data, within 5 days of its release, the Ghostty project attracted over 530 developers, more than 1,000 discussions, and gained over 16,000 stars. Its OpenRank surged past 100, settling at 105.

  • Detailed Analysis: Ghostty is a terminal emulator that runs on MacOS or Linux systems. By utilizing local GPU resources, it enhances terminal functionality and provides a smoother user experience. On December 26, 2024, after more than 2 years of private repository development, Ghostty was open-sourced and officially released version 1.0. The author, Mitchell Hashimoto, founded HashiCorp at the age of 23. He stepped down as CEO in 2016 to become the CTO and later resigned from the CTO position in late 2021 to return to personal programming. He left the company he founded at the end of 2023. Data shows that the Ghostty project was created in March 2022, with over a million lines of code. Initially, Mitchell developed the project alone for two years until mid-2024, when other developers joined. Mitchell remains the primary developer, contributing over 90% of the project's code.

  • Author's Comments: As the founder of HashiCorp, Mitchell loves coding and is the founding engineer and core developer of well-known open-source projects like Vagrant, Consul, Terraform, and Vault. Despite being a multi-millionaire, his passion for coding remains unchanged, which is likely a significant factor in the project's popularity among developers.

  • Further Reading:

Hot Event 2: Generative AI Empowers Embodied Intelligence, Genesis Officially Released

  • Data Facts: According to OpenDigger data, since its release on December 19, 2024, the Genesis project attracted over 500 developers within 10 days, with 21 contributors and nearly 20,000 stars. Its OpenRank settled at 85.

  • Detailed Analysis: Genesis is a research platform for embodied intelligence that integrates generative model capabilities. It consists of a general-purpose physics engine, robot simulation platform, photorealistic rendering system, and data generation engine powered by generative AI technology. This engine converts natural language into training data for various modules. The project is developed by a team led by Dr. Chan, Chief Scientist at the MIT-IBM Watson AI Lab. In late 2023, the team published a paper introducing RoboGen, a framework that uses generative AI to provide unlimited learning data for robots and automate training. After over a year of development, RoboGen was open-sourced as the embodied intelligence research platform Genesis, gaining widespread attention.

  • Author's Comments: Embodied intelligence is a cutting-edge research area in artificial intelligence, with few open-source research platforms available. Facebook's Habitat platform, open-sourced in 2019, is a notable example. With the rise of generative AI, scientists are exploring its application in embodied intelligence to accelerate the development of intelligent robots. Dr. Chan's team, building on a solid theoretical foundation, has integrated generative AI technology into their research platform, which is expected to make significant contributions in this field.

  • Further Reading:

Recommended Projects of the Month

eliza

  • eliza is a lightweight AI agent framework for individual developers, enabling quick creation of personal AI agents and workflows. Since its open-source release in July 2024, the project has focused on development and gained significant popularity in December 2024, with over 10,000 stars and 441 active developers in December. Its OpenRank has reached 149.
  • Repository: https://github.com/elizaOS/eliza

blink.cmp

  • blink.cmp is a code completion plugin for the Neovim editor, unlike the popular Copilot, it is a traditional text indexing and fuzzy search-based completion tool known for its efficiency. It can respond in milliseconds with an index size of 20,000, making it popular among Neovim users. The project was open-sourced in October 2024 and had 294 active developers in December, with an OpenRank of 108.
  • Repository: https://github.com/Saghen/blink.cmp

关于 OpenDigger 标签工作的思考与规划

· 17 min read
Frank Zhao
Ph.D candidate at X-lab, author of OpenDigger

这段时间,对 OpenDigger 的标签做了一次较大的更新,主要是新增了一批项目和企业的标签,以及对国家和地区的开发者占比做了统计,主要用于 BenchCouncil 中的榜单发布(全球行政区划开发者 OpenRank 排行榜, 全球企业 OpenRank 排行榜, 全球项目 OpenRank 排行榜)。因此也有了一些新的思考,这里分享一下,也希望可以抛砖引玉,看接下来如何进一步规划和优化 OpenDigger 的标签体系。

总体而言,OpenDigger 的标签工作分为两部分,标签体系建设和标签工具建设。标签体系的重点在于如何构建以一套有效且易维护的标签结构,而标签工具则是使用怎样的技术方案来实现和维护上述的标签体系。

标签体系建设

OpenDigger 的标签体系早期是源于 OpenDigger 本身的数据需求而逐渐建立起来的。主要是各类数据报告中需要有不同的指标聚合方式,尤其是如企业、国家维度的聚合,所以最初的设计中最主要标注的也就是企业、国家的数据,后续又陆续增加了基金会、技术领域和项目群的相关标签。然而随着标签数据越来越多,维护的难度也开始逐渐上升,缺乏顶层设计的缺陷也逐渐凸显。截止到 2024 年 12 月,已经有上千个标签,涵盖 200 多家企业、数十个基金会和 500 多个项目,这也需要 OpenDigger 建立一套标准化的标签体系,方便后续的持续维护和进一步拓展。

总体而言,早期标签的加入是需求驱动的,并没有进行顶层的统一设计,因此结构上也是趋向于扁平化的,即每类标签单独在一个文件夹下,通过标签之间的 ID 进行交叉引用。但在过去一段时间丰富标签的过程中,发现目前主要的标签需求之间其实存在某种关联性,这种关联性也进而导致了后续的一些设计上的变化,例如:

  • 项目一般都是由某个实体发起,后续可能捐献给基金会的,因此项目可以不单独使用某个目录,而是在其对应的发起实体的目录下维护。
  • 项目发起的实体可以是个人、企业、高校、政府机构(如美国退伍军人事务部、英国司法部)、研究机构(如欧洲核子研究中心)等。这些实体的类型不一而足,但大多与各国当地的机构结构有关,因此总体上虽然结构相似,但在不同国家也会有细微的差异。
  • 对于上述的各类实体,需要一套标准化可行的分类方式,这种分类方式不仅体现在维护性上,而且也是后续各种聚合查询的基础,因为构建在这套标签体系之上的指标查询工具将使用这套标签体系来进行查询。

基于上述的一些反思,因此对于指标体系的建设可以从几个方面来说:

指标结构

从指标结构上来说,之前是扁平化展开的,国家、企业、基金会、高校、机构、项目都是放在同级目录下的,然后会进行交叉引用,例如国家会以高校、企业、基金会等为子标签。然而上述的标签其实都是从项目发起方的角度来看的,因此应该可以构建在同一个目录下,形成“行政区划”-“发起机构”-“开源项目”的三层结构。

  • 行政区划一级主要是指地区信息,如国家,当然也可以更进一步细化到省市一级。
  • 发起机构则是指在法律上实体注册在这些行政区划内的机构,这些机构本身可以进一步进行分类,关于这个分类方法后续再进一步讨论。
  • 项目就是 GitHub、Gitee 上的组织或仓库群构成的开源项目,同一个开源项目可以包含多个组织或仓库,也可以托管在多个平台上。

上述的发起方角度应该是整个标签体系构建的基础,在此基础之上,可以进一步增加其他的并行标签内容,如项目类型、技术领域等,这些标签均以项目标签为基础构建,即它们仅可引用项目级标签为自己的子标签,而不能单独使用平台上的仓库或组织为自己的标签数据内容。即当某个领域出现一个新的项目要标注时,需要先鉴别其对应的发起方及其所在的行政区划,并设置好这些数据后引用该项目标签,而不要直接使用仓库或组织数据。

行政区划

行政区划是发起方所属国家或地区的信息,这部分事实上已经有一些标准可以直接采用。例如 OpenDigger 目前使用 ISO 3166 标准进行国家标注,国家和地区编码部分使用的是 ISO 3166-1 alpha-2 标准,该标准下所有国家和地区使用一个 2 位的英文字母进行标识,同时也包含该国家对应的全称,而恰好 GitHub 发布的全球开发者区划分布也是按照该标准发布(区别在于该数据将欧盟作为一个一级区划),因此较容易进行关联性建立。而对应的 ISO 3166-2 标准则进一步对国家和地区内部的一级行政区划进行了定义,因此国家和国家内部的一级行政区划可以完全使用 ISO 3166 系列标准进行定义。

发起主体

这部分需要比较专业的知识,可能本人的理解也有出入,欢迎指出。

如上所述,发起主体与各国中对于法律实体的定义有关,因此情况也最为复杂。相对而言,高校、政府机构、研究机构是相对明确简单的,而企业和基金会是最为复杂的。

以中美的差异为例,对于大部分企业而言其结构是相似的,尤其是私营企业主要以独资企业、合伙企业、有限责任公司、股份有限公司等形式为主,在 OpenDigger 的标签体系中可以不做额外的区分,就是公司/企业标签即可。主要难点在于基金会的分类:

在中国的实体分类中,一般性企业属于工商部管理范畴,而社会团体、民办非企业单位和基金会则属于民政部管理范畴,这也是为什么国内部分唯二的开源基金会(开放原子开源基金会、重庆开工开物开源基金会)都注册在民政部,其对应的统一社会信用代码以 53 开头,即民政部下属基金会属性单位。可见基金会在中国是一个独立的法人实体类型。且在中国,法律认可的非营利性组织也只有社会团体、民办非企业单位和基金会三类。

但在美国的法律体系中,并不包含一种名为基金会的法人实体,所有的非营利组织在美国都属于企业性质,只是分类会略有不同,主要都在 501(c) 分类下。常见的非营利组织类型包括慈善组织 501(c)(3),如 Apache 基金会就是这类组织;还有商业联盟性质的 501(c)(6),如 Linux 基金会就属于这类组织。它们在财务规定和监管层面有一定的差异,这也是为什么近年来 Linux 基金会可以通过企业捐赠快速扩张发展,而 Apache 基金会则更加佛系的根本原因之一。

也正是由于上述区别,基金会这个名称在中美有了很大的差异,在中国是一类非常明确的法人实体类型,而在美国基金会是非营利组织可选的一种注册名称而已。如美国的连接标准联盟与 Linux 基金会相同也是一个 501(c)(6) 组织,但其名称确为"联盟"。而正是由于这种命名的随意性,使得追踪海外基金会变得非常困难,例如一些自称为基金会的组织,我们甚至在网上无法查证其组织类型以及是否真的是非营利性的组织。

另外一个有趣的差别是,在美国,在一般性企业和非营利性组织之间,还存在一种叫做 PBC(Public Benefit Corporation) 的企业类型,即公益法团。如最近大火的社交平台 Bluesky 背后的公司即属于这类。该类型是一种具有公益性质的营利性组织。对应中文语境中的“社会企业”,但在中国,目前“社会企业”还并非一种具有法律认可的实体类型,主要是由中国慈展会定期进行公开评定,可给各类企业或非营利性组织进行非正式的社会企业认证。当然,在 OpenDigger 的标签体系中,这类还是统一被归为企业类型。

综上所述,在发起主体层面,除明确的高校(University)、政府机构(Agency)、研究机构(Institution)外,其他则分为公司(Company)和非营利组织(NPO)。则在各国法律体系下,基金会均属于非营利组织范畴,而基金会排名对比时则也是与其他非营利组织统一排名,如行业联盟等。

社区项目

虽然上面提到在新的设计中,我们希望为所有项目均找到对应的法人实体发起方。但在现实中,依然会存在没有明确发起人的项目,或发起人希望该项目是完全社区驱动的,又或者发起人为个人的项目,这类项目难以对应到具体的法人实体,因此需要一个社区项目类型来涵盖这部分项目。

需要注意的是这里的社区也只是一种无明确发起方的分类方式,而社区(Community)本身并不是 OpenDigger 标签体系中的一部分。这是由于我发现社区本身的定义非常宽泛和模糊,一个企业项目也可以称自己为社区,一个兴趣团体也可以称自己为社区,因此这会导致该标签可能被滥用,而其对应的排行也就没有太多意义了。不过可能确实存在某些群体需要一个独立身份的情况,后续可能根据需求的变化会进一步细化这部分设计。

总结

因此最新的设计下,总体的标签结构示例应该为:

label_data
├── division # 行政区划
│ ├── cn # 中国
│ │ ├── gd # 广东
│ │ │ └── huawei # 华为
│ │ │ └── openharmony
│ │ └── zj # 浙江
│ │ └── alipay # 蚂蚁集团
│ │ └── tugraph
│ └── us # 美国
│ ├── ca # 加利福尼亚州
│ │ └── linux_foundation # Linux 基金会
│ │ └── valkey
│ └── md # 马里兰州
│ └── apache_software_foundation # Apache 软件基金会
└── technology # 技术领域
├── cloud_native # 云原生
│ ├── platform # 平台
│ └── runtime # 运行时
└── database # 数据库
├── graph # 图数据库 -> 引用 :division/cn/zj/alipay/tugraph
└── kv # 键值数据库 -> 引用 :division/us/ca/linux_foundation/valkey

标签工具建设

标签工具建设是更加偏向技术的一部分,是上述标签体系的具体实现。该实现不仅需要考虑到上述标签体系的所有能力和业务需求,同时也需要向下适配与数据库交互的结构以及标签数据的常见运算,如集合的交并差等。

目前的标签工具是使用 TypeScript 编写的,直接在运行时基于标签数据文件在内存中构建整套标签数据,可实现基础的运算和标签关系查询能力。但长远而言,从可扩展性以及查询效率上,还是希望标签数据可以直接落库,则最终的指标查询只需要做一个联表查询即可。

但由于存在多层标签的父子标签溯源问题(如某个项目是哪个国家发起的),这种溯源在数据库中需要递归 CTEs 特性的支持,而 OpenDigger 目前底层的 ClickHouse 版本尚不支持该特性,因此需要等待 ClickHouse 升级后再进行改造。

November 2024 Open Source Ecosystem Data Insight Report

· 5 min read
Frank Zhao
Ph.D candidate at X-lab, author of OpenDigger
Will Wang
Prof. @ ECNU / Founder of X-lab

The OpenRank indicator is an open source implementation of the evaluation indicators in the "Information Technology Open Source Governance" series of standards of the Electronic Standards Institute of the Ministry of Industry and Information Technology. It can effectively reflect the collaborative influence of open source projects among developers, thereby helping us understand the open source world, discover open source trends, and gain insight into open source events.

Hot Event 1: BlueSky's Surge, Driven by US Elections and AI Wave

  • Data Facts: According to OpenDigger data, multiple BlueSky repositories on GitHub experienced a surge in activity. This includes their decentralized social media protocol repository, atproto, and the client repository, social-app. The total number of active developers across all repositories in November increased by 173% year-over-year to 1,082, with a total star increase of 5,800. The total OpenRank value increased by 67%, reaching 340 points.

  • Detailed Analysis: BlueSky is an independent project created by former Twitter CEO Jack Dorsey, using a newly developed AT social network protocol to achieve a decentralized social media platform. Following the US elections on November 5, some users dissatisfied with the election results chose to leave Twitter in search of new social platforms, with BlueSky becoming a significant option. A week after the election, its client app topped the free app charts in the US Apple App Store. Additionally, on November 16, Twitter updated its Privacy Policy to allow third-party platforms to use user data for generative AI training. In response, BlueSky officially stated that it would not use user data for generative AI training, leading many high-quality content creators to migrate to BlueSky to protect their digital content. The platform had approximately 10 million registered users as of September 2024, and various events in November led to a surge in users, with registered users exceeding 20 million by November 20.

  • Author's Comments: The tech world is not isolated from real-world events, which can significantly impact the open-source community. The rise of generative AI has also highlighted underlying issues, with developers and users taking action to protect their interests.

  • Further Reading:

Hot Event 2: Redis Attempts to Control Peripheral Projects, Valkey Community Continues to Grow

  • Data Facts: According to OpenDigger data, the number of active developers in Redis's Rust client repository, rust-rs, increased by 54% in November 2024 to 40, with many participating in discussions about Redis's request for the project's author to transfer control. Meanwhile, the Valkey community, which forked from Redis in March 2024, continues to grow, surpassing the main Redis repository in various metrics.

  • Detailed Analysis: On November 25, 2024, Armin Ronacher, the author of Redis's Rust client project rust-rs, opened an issue discussing Redis's request for control over the project. The maintainer of Redis's PHP client, Pedis, reported receiving a similar request. This is not Redis's first attempt to control community projects; between 2020 and 2024, Redis transferred several community clients, including Jedis, Redis-py, and Lettuce, to its GitHub organization. Meanwhile, there are concerns that new versions of community clients controlled by Redis may not be compatible with Valkey. Valkey is a community fork of Redis created in March 2024 after Redis announced changes to its project license. It is led by core developers from AWS, Alibaba Cloud, Google, and Tencent Cloud and is now hosted by the Linux Foundation. Since the Redis community split, Valkey has developed steadily, while Redis has become less active. According to OpenDigger data, Valkey's main repository OpenRank reached 71 points in November, while Redis's main repository dropped from 62 points in March to 27 points.

  • Author's Comments: Software ownership involves more than just code; it affects a project's sustainability and community trust. When an open-source project's ownership is transferred to a commercial company, community members often worry about the project's neutrality and openness. The future development of Redis and Valkey remains to be seen.

  • Further Reading:

Recommended Projects of the Month

Julia

  • Julia is a high-performance dynamic programming language for numerical analysis and computational science, first developed in 2009 and released version 1.0 in 2018. With continuous improvements to its language core, the development focus has shifted towards supporting standard libraries. In November 2024, the community moved linear algebra-related standard libraries to a separate repository, transferring thousands of related issues. This migration was noted in logs as new issues, drawing attention from data insights. Julia's development remains stable, with an OpenRank value of 242 across all repositories as of November 2024.
  • Repository: https://github.com/JuliaLang/julia

Zen Browser

  • Zen Browser is an open-source browser based on the Firefox engine, open-sourced in April 2024 and gaining popularity in August. In November, the repository had 882 participating developers. Known for its excellent user experience, the browser features a split-screen display, a popular feature not natively supported in Chrome. According to OpenDigger data, the repository's OpenRank value reached 262 in November, ranking 63rd globally.
  • Repository: https://github.com/zen-browser/desktop

October 2024 Open Source Ecosystem Data Insight Report

· 5 min read
Frank Zhao
Ph.D candidate at X-lab, author of OpenDigger
Will Wang
Prof. @ ECNU / Founder of X-lab

The OpenRank indicator is an open source implementation of the evaluation indicators in the "Information Technology Open Source Governance" series of standards of the Electronic Standards Institute of the Ministry of Industry and Information Technology. It can effectively reflect the collaborative influence of open source projects among developers, thereby helping us understand the open source world, discover open source trends, and gain insight into open source events.

Hot Event 1: Linux Removes Russian Maintainers, Huawei Releases Native HarmonyOS

  • Data Facts: According to OpenDigger data, OpenHarmony has rapidly grown since its open-source release in August 2019, becoming the top-ranked open-source community in China. Currently, OpenHarmony projects are primarily hosted on the Gitee platform, with over 2,000 repositories, more than 8,000 contributors, and over 15,000 active developers. More than 70 tech companies, including Ruyi Software, Softpower Technology, Shencanhong, and Jolian Technology, are involved in its development.

  • Detailed Analysis: In late October 2024, Linux removed over a dozen Russian developers from its kernel maintainer list due to "compliance requirements." Linus Torvalds responded firmly to other developers' questions in a subsequent mailing list. This event drew significant attention in the open-source community, highlighting the increasing impact of geopolitics on open-source technology. In May 2019, Huawei was added to the US Entity List, preventing it from using Google's Android OS. In response, Huawei released HarmonyOS in August 2019 and open-sourced its core code as the OpenHarmony project, donating it to the OpenAtom Open Source Foundation in May 2020. After over five years of development, OpenHarmony has become the highest-ranked open-source project group in China. In late October 2024, Huawei released a fully independent, natively developed HarmonyOS based on OpenHarmony's development, marking the project's maturity.

  • Author's Comments: While technology itself is borderless, technologists have nationalities. In the face of significant geopolitical changes, we must maintain an open and cooperative stance while being prepared to lead and develop our core technologies. Only then can we leverage technology to drive national development and ensure strong global competitiveness.

  • Further Reading:

Hot Event 2: Open Source Summer Programs Conclude, Global Summer Activities Thrive

  • Data Facts: According to OpenDigger data, due to the impact of the National Day holiday in China, most projects experienced a decline in OpenRank during October. However, due to the popularity of OSPP and GSoC, related projects saw an overall increase of 3.5%, with thousands of participants involved in summer activities.

  • Detailed Analysis: Both OSPP (Open Source Promotion Plan) and GSoC (Google Summer of Code) concluded in October. According to official data, both programs set new records in 2024, with 561 and 1,133 projects, respectively. OpenDigger data shows that similar summer programs targeting college students are emerging globally. For instance, the GSSoC24 (GirlScript Summer of Code) program launched in India in October, with over 2,000 students registering for certificates. Additionally, Woowa Brothers in South Korea initiated programming training courses targeting students, with over 4,500 learning PRs and 28,000 PR review comments across 10 learning repositories in October, placing multiple repositories on the global OpenLeaderboard.

  • Author's Comments: In recent years, open-source summer programs for college students have become more numerous and diverse. These programs not only produce excellent software but also provide students with valuable coding and practical experience, becoming important platforms for their technical growth and innovation.

  • Further Reading:

Recommended Projects of the Month

freeCodeCamp

  • freeCodeCamp is a popular online learning platform that teaches programming and web development skills through interactive methods. It offers free resources, including thousands of coding challenges, projects, algorithms, and front-end development practices. Its main repository has over 400,000 stars, consistently ranking first on GitHub's star chart. In October 2024, freeCodeCamp participated in Hacktoberfest, attracting more developers. During the month, 380 developers contributed, resulting in 435 PRs and over 2,200 discussions, boosting the project's OpenRank by 50% to 151.
  • Repository: https://github.com/freeCodeCamp/freeCodeCamp
  • Comment: Both freeCodeCamp and Hacktoberfest started in 2014, and their combination continues to inspire creativity after a decade of development.

Bolt.new

  • In early October 2024, StackBlitz, the company behind the WebContainer project, launched Bolt.new. This new product integrates AI assistants based on large language models with WebContainer technology, enabling local code generation and Node.js execution in the browser. This allows Node.js based software projects to be developed, debugged, and deployed entirely within the browser. The launch was well-received, with over 600,000 views on Twitter. Within a month, the repository received over 6,600 stars, and more than 1,100 developers participated in discussions and collaborations, resulting in an OpenRank of 163.
  • Repository: https://github.com/stackblitz/bolt.new
  • Comment: The emergence of large language models has significantly enhanced programming productivity, while WebContainer technology has revolutionized application deployment. Their combination provides unprecedented convenience and experience for developers, greatly inspiring their enthusiasm and creativity.

Hackpad

  • Hackpad is an interesting hackathon project initiated by Hack Club, a global community of high school hackers. The project invites developers to submit mini keyboard designs, including PCB designs, hardware models, and software programs, during the event. The organizers will produce physical keyboards based on the accepted designs and distribute them to participants. In October 2024, 178 participants submitted 287 PRs, contributing to the repository's OpenRank of 100.
  • Repository: https://github.com/hackclub/hackpad
  • Comment: Open-source collaboration platforms provide fertile ground for global community development. Hack Club, a global tech community of young students, stands out with creative ideas and activities, reminding us of the original hacker spirit: just for fun!

OSPP 2023 深度洞察报告

· 15 min read
Frank Zhao
Ph.D candidate at X-lab, author of OpenDigger
Will Wang
Prof. @ ECNU / Founder of X-lab

背景介绍

开源之夏 OSPP 是中国科学院软件研究所发起的“开源软件供应链点亮计划”系列暑期活动,旨在鼓励高校学生积极参与开源软件的开发维护,促进优秀开源软件社区的蓬勃发展,至今已成功举办五届(2020 ~ 2024),X-lab 开放实验室从第一届就开始深度参与。

OpenDigger 作为一直以来深入参与 OSPP 的开源数据研究项目,也在此就 OSPP 2023 年的数据做一次深度的分析,也算是对 OSPP 社区的一次回馈。

OSPP 2023 宏观数据

根据 OSPP 社区的数据报告,2023 年度,OSPP 总共发布了项目 593 个,有学生中选项目共计 504 个,最终结项项目为 421 个,结项率高达 71%

OSPP 2023 年度高校贡献度排行榜
项目总数中选项目数结项项目数结项率(%)高校数量
593increase/decrease91
504increase/decrease56
421increase/decrease73
71increase/decrease2
144increase/decrease13

最终结项项目大部分除了个别与操作系统内核相关的社区使用了自己的 git 仓库外,大部分社区均托管于 GitHub(298 个)、Gitee(112 个)等代码托管平台上,平台的总体分布如下:

从结项项目的学生所属高校来看,结项的 421 个项目由分别来自 144 所高校的学生最终完成,其中北京邮电大学、浙江大学、华中科技大学以 20 个以上的学生数量领跑各高校,具体的分布如下所示:

年度贡献度分析

除了上述一些统计数据外,我们也希望可以给出一些更加深入的洞察,例如每个高校中不同学生在社区中具体的贡献度等,这种精细化的分析也有助于我们进一步观察学生在整个过程中对于项目的协同参与程度,而不仅仅局限于学生是否仅是完成了一个特定的任务。

注意:受限于 OpenDigger 目前的底层基础数据,下述分析将仅包含 GitHub、Gitee 平台上的数据。

我们使用了 2023 全年的贡献度数据和社区 OpenRank 算法对参与到各社区学生的参与度进行了详细的分析,最终统计到各高校总体贡献度前 20 名如下表所示:

OSPP 2023 年度高校贡献度排行榜
#高校名称OpenRank参数学生数人均 OpenRank
1华中科技大学
67.3increase/decrease43.57
21increase/decrease3
3.21increase/decrease1.89
2浙江大学
61.23increase/decrease16.62
23increase/decrease9
2.66increase/decrease2.9
3北京邮电大学
60.19increase/decrease35.17
27increase/decrease5
2.23increase/decrease0.75
4西安电子科技大学
60.05increase/decrease37.86
13increase/decrease4
4.62increase/decrease2.15
5复旦大学
59.7increase/decrease7.51
4increase/decrease8
14.93increase/decrease10.58
6西安邮电大学
55.67increase/decrease24.09
10increase/decrease3
5.57increase/decrease3.14
7华东师范大学
54.15increase/decrease19.2
13increase/decrease2
4.17increase/decrease2.5
8电子科技大学
50.6increase/decrease35.74
14increase/decrease8
3.62increase/decrease1.14
9重庆邮电大学
48.92increase/decrease24.29
5increase/decrease3
9.78increase/decrease2.53
10上海交通大学
48.34increase/decrease40.83
6
8.06increase/decrease6.8
11杭州电子科技大学
41.99increase/decrease34.6
11increase/decrease8
3.82increase/decrease1.35
12陇东学院
39.48new
1new
39.48new
13中国科学院大学
37.36increase/decrease23.15
18increase/decrease10
2.08increase/decrease0.3
14南京大学
33.9increase/decrease32.41
17increase/decrease15
1.99increase/decrease1.25
15同济大学
21.35increase/decrease15.98
6increase/decrease4
3.56increase/decrease0.87
16武汉大学
19.02increase/decrease11.33
1increase/decrease3
19.02increase/decrease17.09
17东南大学
18.57increase/decrease8.54
8increase/decrease3
2.32increase/decrease0.32
18北京工业大学
18.52increase/decrease18.52
3increase/decrease2
6.17increase/decrease6.17
19成都信息工程大学
18.11new
1new
18.11new
20福州大学
16.21increase/decrease8.01
5increase/decrease4
3.24increase/decrease20.98

我们在给出了高校总体贡献度的同时也给出了校人均 OpenRank 贡献度,可以看到华中科技大学、浙江大学、北京邮电大学依凭学生数量优势依然排在贡献榜前三位,但也有些高校因为很高的人均 OpenRank 贡献度而上榜,如复旦大学、陇东学院、武汉大学、成都信息工程大学等,他们在学生数量上并不占优,但因为个别学生的贡献度较高而使得最终的排名较高。

为了进一步观察学生的贡献情况,我们也对学生贡献者进行了 OpenRank 贡献度的排名,OpenRank 前 20 的学生如下:

OSPP 2023 年度学生贡献度排行榜
#学生姓名OpenRank学校参与社区活跃月数
1王**50.361复旦大学Apache HugeGraph16
2潘**44.955上海交通大学MatrixOne19
3姬**39.475陇东学院Spring Cloud Alibaba19
4孟**34.52重庆邮电大学Apache SkyWalking18
5刘**25.838西安电子科技大学OpenMessaging10
6王**25.15电子科技大学MegEngine(旷视天元)13
7谭**24.831华中科技大学GraphScope12
8张**19.65西安电子科技大学泰晓科技9
9乔*19.016武汉大学Apache RocketMQ社区14
10周**18.924中国科学院大学openEuler 社区9
11黄**18.115成都信息工程大学CubeFS15
12朱**17.194华东师范大学OpenDigger14
13应**16.561杭州电子科技大学Volcano社区10
14李**14.307华东师范大学OpenDigger14
15丛**14.045山东大学Apache HugeGraph12
16徐*13.995华东理工大学Apache Kvrocks (Incubating)8
17刘*13.865华中科技大学Apache HugeGraph16
18陈**13.452浙江大学Curve6
19张**12.606西安邮电大学Linux内核之旅开源社区16
20兰**12.581四川大学DLRover8

通过对于学生个体的分析,一些贡献度极高的学生就可以清晰的看到,例如来自陇东学院的姬同学在 Spring Cloud Alibaba 社区、来自成都信息工程大学的黄同学在 CubeFS 社区、来自武汉大学的乔同学在 Apache RocketMQ 社区的参与,他们都仅凭一己之力将自己学校的总体贡献度拉入到高校前 20。

同时上表也给出了这些同学从 2023 年 1 月到 2024 年 7 月中在参与项目中的活跃月数,可以看到前 20 位的同学的活跃月数均达到了 6 个月以上,而上述提到的几位同学贡献时长都达到了 12 个月以上,这里也体现出了 OpenRank 鼓励长期贡献的价值取向。

相应的,我们也给出了 2022 年学生贡献排名前 20 位的同学:

OSPP 2022 年度学生贡献度排行榜
#学生姓名OpenRank学校参与社区活跃月数
1唐**42.181华东师范大学Apache ECharts29
2程*40.912浙江大学Karmada23
3杨*35.699中国传媒大学Element Plus22
4朱**31.264东北大学Apache Dubbo23
5容*25.844百色学院Apache APISIX27
6黄**24.218福州大学Apache RocketMQ 社区12
7孟**24.177重庆邮电大学Apache Pulsar30
8宋**22.948复旦大学Apache SkyWalking27
9陈*19.426北京邮电大学Milvus25
10范**16.426University College London, University of LondonApache Pulsar8
11张**14.617华东师范大学DevLake17
12赵**13.8北京邮电大学OpenMLDB5
13杨*13.085西安邮电大学Curve18
14崔**12.279桂林电子科技大学MegEngine(旷视天元)28
15叶**11.502College of William and MaryAlluxio6
16韩**9.98北京邮电大学KubeVela15
17张**9.443湖南工业大学科技学院Apache DolphinScheduler9
18杨**9.157中国原子能科学研究院Jina AI10
19吴**9.077浙江大学Linux内核之旅开源社区9
20吴**8.831New York UniversityHypercrx30

后续持续贡献分析

我们可以看到,OSPP 拉动了大量高校的优秀学生在校期间就深入参与到开源社区的贡献之中,那么这些学生后续的活跃情况如何呢?为此我们也进行了更长期的跟踪分析,看一下在 OSPP 结束之后,还有多少的同学继续留在社区中持续的参与贡献。

上图是 2022 年 1 月到 2024 年 7 月所有结项学生的贡献度变化情况,我们可以看到虽然在每年的 9 月份是一个贡献高峰期,但在全域的贡献上保持了一种相对稳当的状态,说明学生们除了参与 OSPP 以外,后续也持续的参与到了开源世界其他项目的贡献之中,也说明 OSPP 为他们打开了一扇通往开源世界的大门。

学生全域贡献度排行榜
学生姓名OpenRank学校参与项目
杨*315.068中国传媒大学YunLeFun/status
YunYouJun/valaxy
element-plus/element-plus
姬**148.622陇东学院alibaba/spring-cloud-alibaba
spring-cloud-alibaba-group/spring-cloud-alibaba-group.github.io
apache/hertzbeat
刘**136.224杭州电子科技大学iyear/tdl
iyear/pure-live-core
devstream-io/devstream
唐**132.826华东师范大学hypertrons/hypertrons-crx
X-lab2017/open-wonderland
X-lab2017/open-research
郑**132.375浙江大学eunomia-bpf/eunomia-bpf
eunomia-bpf/bpftime
eunomia-bpf/bpf-developer-tutorial
刘**107.148电子科技大学SciSharp/LLamaSharp
SciSharp/TensorFlow.NET
Oneflow-Inc/oneflow
容*91.659百色学院apache/apisix-ingress-controller
apache/apisix
apache/apisix-helm-chart
崔**89.637桂林电子科技大学PaddlePaddle/Paddle
PaddlePaddle/PaddleSeg
openvinotoolkit/openvino
左*89.047哈尔滨医科大学Well2333/nonebot-plugin-bilichat
djkcyl/BBot-Graia
IceTiki/ruoli-sign-optimization
林**88.883华东交通大学Undertone0809/promptulate
PKUFlyingPig/cs-self-learning
langchain-ai/langchain

我们可以看到除了 OSPP 的开源社区外,很多同学还大量参与了其他开源社区的贡献,而来自陇东学院与百色学院的两位同学则是长期参与到了自己参加的 OSPP 的社区之中,成为了稳定的贡献者甚至 Committer。

Redis 修改开源协议!云厂商真的在白嫖开源社区吗?

· 11 min read
Frank Zhao
Ph.D candidate at X-lab, author of OpenDigger

缘起

2024 年 3 月 21 日,著名的键值数据库开源项目 Redis 背后的公司 Redis 的 CEO Rowan Trollope 宣布修改项目的许可证类型,从原先的 BSD 开源协议修改为 RSALv2SSPLv1 双协议。

这次的许可证变更主要是为了保护 Redis 自己的商业化利益,避免云厂商免费使用开源版本提供 Redis SaaS 收费服务。这样的操作其实并不少见,之前如 Confluent、MongoDB、Elastic 等公司就对旗下的开源项目进行过类似的许可证变更以保护自己的权益。而这次 Redis 的操作却引发了很多开发者的愤怒,其中很重要的原因就是 Redis 社区中包含了大量外部的贡献者参与,这种单方面的协议修改显然是破坏社区和伤害这些贡献者的行为。

那么到底是谁在深度的参与 Redis 的社区贡献呢?

深入

从下面的图中可以看出,自 2020 年至今,从每年 Redis 项目社区 OpenRank 前十位的开发者贡献度来看,Redis 项目社区其实一直在趋向于多元化,Redis 内部开发者的贡献比例从 2020 年的将近 80% 逐年下降,截止到 2024 年第一季度,贡献度前十位的开发者中 Redis 内部的贡献比例已经不足四成,AWS、阿里云、腾讯云、爱立信等众多厂商均已常年深度参与到了 Redis 社区的贡献之中,并且贡献强度都还在逐年增加。

2020 年 6 月底,Redis 的最初作者 Salvatore Sanfilippo(@antirez)发表博客退出了 Redis 社区的日常维护工作,将社区维护任务交由当时还名为 RedisLabs 的 Yossi Gottlieb(@yossigo)和 Oran Agra(@oranagra),同一时间上述两人发文表示将开启新的社区治理模式,并与 Itamar Haber(@itamarhaber)率先组成 Redis 社区的核心开发小组,次月 AWS 的 Madelyn Olson(@madolson)和阿里云的赵钊(@soloestoy)加入核心开发小组,而这个五人小组也是直到这次 Redis 修改许可证之前一直稳定的 Redis 社区的核心小组。

除了上述提到的六位核心开发者外,腾讯云的朱彬彬(@enjoy-binbin)是因为长期参与 Redis 而加入了腾讯云数据库产品部,而阿里云除了赵钊外还有三位开发者出现在了历年贡献前十位的榜单之中。整体而言,目前 AWS、阿里云、Google、腾讯云等云厂商日常参与到 Redis 社区贡献的共有近 20 人,云厂商在 Redis 社区的投入还是显而易见的,这与大众意识中云厂商白嫖开源社区的印象大相径庭。

分裂

也正是由于大量云厂商贡献者的参与,在 Redis 宣布修改许可证后,AWS 的 Madelyn Olson 便立即发起了一个 Redis 的 Fork 项目 Valkey,并计划将其托管在 Linux 基金会,而 Google、爱立信都已明确表示将投入支持 Valkey 社区的发展。

而其他的云厂商的开发者也将几乎别无选择的将迁移到 Valkey 项目中,因为新的 Redis 对云厂商的排斥使他们没有办法继续留在 Redis 社区中进行贡献。而 Redis 似乎也没有打算再让社区深度参与到后续的研发中,根据几位中国 committer 的反馈,GitHub 上 redis-committers team 的权限已在一周内被回收,取消了外部 committer 的仓库写权限和 Issue/PR 的管理权限,现在他们在 Redis 项目中里的权限,其实就跟普通用户完全一样了。

“除了参与到 Redis 社区具体的功能贡献外,我们也会将在云产品中积累包括功能、性能、稳定性和可观测性等等方面的修复和改进贡献会社区,同时云产品丰富的用户基础也为上游社区传递了大量业务场景中的真实需求。我们相信这是我们的责任,也相信一个蓬勃发展的开源社区值得我们去维护。”阿里云的赵钊提到。

从数据上来看,在 2024 年 Redis 贡献前十位的开发者中,除了 Redis 公司的两位研发人员外,剩下有七位都已经参与到了 Valkey 项目的研发当中,这意味着事实上 Valkey 项目已经成为新的社区进入了正常的运作,而 Redis 公司的研发人员后续将独立开发维护 Redis 项目。

上图数据更新于 2025.1

从宏观数据来看,Redis 社区近半年的 OpenRank 协作影响力维持在 80 左右,而 Valkey 在三月仅开源十天,OpenRank 已飙升至 40 左右,达到了 Redis 项目的一半。从社区参与开发者人数来看,Redis 基本维持在每月 100 人左右的规模,在三月因为许可证修改有不少开发者在 Redis 社区进行讨论,导致参与人数翻倍,达到了 220 人,而 Valkey 开源十天,参与人数达到了 146 人,已超过了 Redis 的日常规模。

总体来看,Redis 社区的分裂之势已无法逆转,随着 Valkey 捐献到 Linux 基金会,相信会有更多开源开发者参与到 Valkey 的贡献和研发之中。

涟漪

就像 OpenRank 算法背后的主张,这个世界总是相互联系和影响的,任何的事件将不仅仅会影响到自己,而是会波及到与其相关的其他部分。就像在 2023 中国开源年报中,我们通过数据发现,2023 年 9 月 Unity 修改收费策略,直接导致开源游戏引擎 godotengine 在当月迎来了其开源以来的最大一波增长,这个开源已经超过十年的项目的 8 万 Star 中有超过 1 万 Star 是来自 2023 年 9 月,游戏开发者们用对开源的支持回应了 Unity 的决策。

而 Redis 的协议变更除了带来了一个新的分叉社区 Valkey 以外,也有很多对键值数据库有需求的开发者开始关注到与 Redis 相关的其他开源项目。Apache 软件基金会的 kvrocks 就是其中之一。与 Redis 是内存型键值数据库不同,kvrocks 是一款磁盘型键值数据库。从下图可以看到,三月份 kvrocks 的各项指标数据都有一波明显的增长,可能也是因为它是基金会项目,在这个开源项目背后的企业可以跳过设定的社区规则单方面随时更改许可证的时代,托管在基金会的项目会让开发者更有安全感一些。

结语

云厂商吸血开源项目在过去几年来一直是被开源开发者们所诟病的,但一切都在悄悄发生着变化,有更多的云厂商也意识到社区的重要性,愿意投入人力甚至物力到自己以来的开源社区中去,以保证自己的云服务可以更好的与上游协同发展。

在未来,我们相信开源社区的上下游可以更好的协作从而形成共赢多赢。有效的开源贡献和影响力评价是形成更健康更有效的协同机制的前提,发现那些真正贡献的开发者,让他们创造的也能够属于他们。

Awesome OpenRank

· One min read
Will Wang
Prof. @ ECNU / Founder of X-lab