CodeCompass: Open Source AI for Personalized GitHub Discovery

20 Aug 2024 (11 months ago)

Open Healthcare Network and CodePilot

Open Healthcare Network is an open-source project that connects hospitals with care centers and helps track patient journeys. (1m38s)
The project, built with contributions from over 400 people worldwide, aims to address the shortage of healthcare professionals in India. (2m38s)
CodePilot has significantly improved the quality of code in the project, acting as a personal assistant for developers. (2m52s)

CodeCompass can generate recommendations for new users in about a minute. It also takes about a minute for the Streamlit app to load all of the data. (14m19s)
The chatbot component of CodeCompass allows users to interact with repositories, extract file structures, get file contents, view branches and commit histories, and search repositories and commits by keywords. (15m15s)
The chatbot can provide summaries of code within specific files, even if the user is unfamiliar with the programming language. (16m59s)
CodeCompass is a tool that facilitates personalized recommendations to improve the developer experience, especially for those new to open source and overwhelmed by the vastness of platforms like GitHub. (47m0s)

Gabriel Deel is a student at IE University of Madrid and worked as a project manager and data engineer on the CodeCompass project. (8m42s)
M. Helen Hofland is a Norwegian student at IE University who worked as a data engineer on the project. (9m17s)
Luca, a Peruvian student at IE University, contributed to the data engineering team and assumed a project lead role, focusing on code quality and documentation. (9m41s)
Ky Soloman, from Georgia, worked as a data scientist and MLOps engineer on the project. (10m17s)
Miranda Germond, of English and Italian descent, took on multiple roles including data scientist, MLOps, and data engineering. (10m49s)

The project uses a large dataset of GitHub information, larger than a comparable dataset found on Kaggle. (22m6s)
The dataset was created by querying the GitHub API for users with at least 1,000 followers and 10 repositories. (22m35s)
The data collected includes user information, repositories, and repositories they have starred, with a limit of 10 repositories per user. (23m36s)
The project initially used Google Cloud to store and manage CSV files containing generated data. However, as the data grew, uploading and downloading these files became problematic. (25m21s)
To address the data management challenges, the team explored using Redis. A branch named "redis 2" was created to implement a primary database in Redis. (25m43s)

The team considered using long and short-term user representation (LST) as an alternative algorithm. However, due to the lack of time-stamped user interaction data, this option was deemed unsuitable for the time being. (30m16s)
The developers chose to use CSV files instead of JSON files because they found them easier to work with for the initial implementation of the project. (32m10s)
The developers used GPT 3.5 and GPT 4 for their project, but they found that GPT 3.5 did not provide the level of depth and detail they were looking for. (33m22s)
The developers implemented Llama 3, an open-source language model, as part of their project. (34m32s)
The CodeCompass system uses OpenAI's assistance API, specifically the GPT-4 model, to process user queries and interact with the GitHub API. (36m17s)
The system can handle both general knowledge questions and requests related to specific GitHub repositories, such as retrieving repository structure or content. (37m0s)

Future improvements include integrating open-source language models like Gemini and Langchain, allowing users to choose between different models, and hosting the system with a robust database like M's database for wider accessibility and feature implementation. (39m0s)
Potential improvements to the project include hosting it and implementing a pipeline for continuous data scraping and comparison. This pipeline would track user numbers, repository presence in the database, and facilitate model fine-tuning. (41m32s)
To enhance data loading and generation, there are plans to explore in-memory and open-source databases like Redis. This would involve directly querying the database and potentially using Redis Enterprise for enhanced value and recommendation speed. (42m32s)
Future improvements also encompass adding compatibility for private repositories and exploring integration with platforms beyond GitHub to create a cross-platform recommender. (43m0s)

It is recommended to open an issue to discuss potential improvements with the team before submitting a pull request. (46m20s)

Miguel, who guided the project, believes that CodeCompass is impactful enough to be integrated into a real organization and encourages the creators to connect with GitHub for potential integration. (50m26s)
CodeCompass is a fantastic project, and the team behind it should be proud of their accomplishment in such a short time. (57m58s)

Gabriel's advice for learning is to build something useful, even if it's just for personal use. (54m30s)
Kitty emphasizes the importance of starting from scratch and iteratively building upon the project, prioritizing progress over perfection. (55m9s)
Miranda encourages embracing failure as a learning opportunity and seeking help when needed. (55m33s)
Luca suggests starting with a small project and gradually scaling it up, incorporating testing and modularity along the way. (56m12s)
Mod advises not to be afraid of being a beginner, as everyone starts somewhere, and emphasizes the importance of trying. (56m44s)
People should try new things in the tech industry, even if they consider themselves advanced, as there is always something new to learn. (57m30s)