Spring 2023
Smith College
To form project teams and start planning.
The creator of each project idea has the option to speak for ~2 mins to explain their idea.
You are selling the idea to potential team mates.
Only ~8 of the top 10 projects will form valid teams of 3~4 members.
Things to cover:
My package will help users budget better. To solve this problem, user will in-out their monthly income/how much money they would like to spend. The package will calculate how much money the user should spend on housing, food, transportation, entertainment, healthcare, savings, and other. The program will also provide users with lists of recipes to stay on budget. The user will also be able to compare different costs of transportation. Personally, I am interested in this project because I love working with numbers. The output will be some sort of detailed budget plan with tips to stay on budget.
The package would generate a travel itinerary including transportation and housing, based on input from a user about their destination, days of travel, budget, a genre of activities they enjoy, and how fast-paced they’d like their schedule to be.
The R package would use data from various sources (e.g., travel websites, tourism boards, etc., so web scraping would be involved) to generate a list of potential travel options that match the user’s preferences. I imagine this would also involve a lot of regex to parse through all the site data.
I’ve wasted a lot of time planning trips before, even short-term ones. There are so many moving parts and complications that arise at any second, so some kind of package to help assist with that seemed like a useful idea. I think developing it would exercise a lot of the skills we’ve learned in class and help me learn new ones, even if the final package isn’t one a user would execute for data science purposes.
As for the end result of the package: users could then select the generated options they like best and the package would generate a full itinerary, complete with daily activities, travel times, etc. displayed in a clean, easy to understand manner. There would then be an option to export the itinerary (PDF? CSV?) so that it could easily be shared with other people.
Webscraping Archive.org to study the Environmental History of the Global South
I believe it will take not only scientists but also humanists working together to better understand and ultimately transform humanity’s relationship with the rest of nature. That is why I am an environmental historian of the Middle East, and that is why I want to leverage our data science skills to improve our understanding of humanity’s experiences with past global climate crises.
The project I have in mind will allow people to easily explore past human experiences of major climatic events with an R package for creating machine-readable text corpuses out of public domain publications from www.archive.org. Archive.org has an amazing collection of historical publications, but the problem is that it is time-consuming to search each individual document. So the goal would be to create an R package that makes it easy to pick a historical publication and search its entirety for the years/dates of key words like flood, drought, famine, and then produce tidy tables and visualizations to explore the results. This would also be useful for text analyses. The test case that I have in mind is a global publication from the 1800s and early 1900s that brings together observations from across East Asia, Southeast Asia, South Asia, the Middle East, and Africa.
The skills I anticipate using are web scraping, data wrangling, cleaning, and visualization. The ultimate goal is to render the text of historical publications in Archive.org searchable and visualiziable for researchers interested in climate science, political ecology, and environmental history (but the corpuses could have many other uses as well!). The primary output would be a package that can produce a machine-readable corpus. Depending on how that goes, secondary outputs would be using text analysis to generate visualizations and data for key words that we could pair with historic climate data.
An advanced to-do list that allows you to set an estimated time of completion for each task and starts a timer when each task begins, prioritize tasks by importance using personal choice of color palettes, and check off tasks when they are completed.
I believe someone working on this project would need strong organizational skills and some experience with website design or art.
I am interested in this project because I have never been able to find a to-do list that motivates me to do my work. I have tried amazon to-do lists and spreadsheets but I would use them for a few weeks and never touch it again. This to-do list has a lot of features that ensures that people stay on top of their work.
Given a user’s Discover Weekly, the package organizes songs by its “category ID” given by the Spotify API.
This package will organize songs for listeners who do not know how to organize their listening habits. It also will give listeners insight into what Spotify thinks they would like to listen to.
The skill sets needed are data frame managing and knowledge with urls.
I am interested in learning more about listening habits of myself and others.
The output would be a data frame with the songs on the discover weekly, the category ID, the url, and other features of the song.
The package will generate meal prep ideas from NYT cooking (or any other good cooking website) based on the ingredient lists you add. If the ingredient list is more than 3 items, you will have the opportunity to rank your three most important ingredients or specify the cuisine.
The return will be the list of top 10 meal ideas (title, link, instruction) ranked based on rating on the website. The strategy used to solve the problem would be searching the NYT cooking website using the ingredient lists that the user inputs.
The skill sets needed for this project idea would be web scrapping, RegEx, iteration, etc. As a person interested in cooking and preparing meals, it is exciting to see a new R package useful for other home cooks.
A package to pdf search and webscrape through all the syllabus files we input, to return a list of due dates sorted by date and a ggplot showing how many hours each class will take.
My project idea is to create a package that will help students to track their major progress. If you input your major, it will output a data frame that has all the requirements needed to graduate with that major. Additionally there will be a list of all possible classes that could satisfy those requirements and more info telling whether it’s offered in the fall / spring or if there are any prerequisites. This package would greatly help any student who wishes to organize and plan for their future at Smith. If this is too difficult, it could be sized down to only focus on one or two majors.
This package would look at the most recent course catalog on the Smith website, so web scraping skills should be used. Data wrangling / cleaning data would also be needed.
The problem my package will solve is not knowing details about a professor and having to search hard to determine whether to take a course by them.
The package will look through https://www.ratemyprofessors.com/ and a campus website to give information about a professor. The package requires lots of web scraping and problem solving with web scraping.
I am interested in the project because I use ratemyprofessor for every class I take, but want some help deciding whether the professor is good.
The output would give key words about the professor, what they teach, example classes, and their rating.
A movie recommendation package. This package will recommend movies to watch next based on the user’s input of a movie. The movie generator will recommend a list of movies that have a similar genre.
Input: Enter the name of a movie Output: A list of 5-10 recommendations of movies the user may like based on the movie entered
Using the Gale–Shapley algorithm I use the survey results to match people with their teams.
I am treating every project as if it had a preference of 0 toward everyone except the project creator.
There is some randomness, but the goal is to find a stable optimal solution so everyone is on a project they enjoy.
Link to GitHub Classroom Final Assignment
We we cover how to start collaboratively coding using GitHub on Monday.
What is the absolute minimum this package needs to do to “work”?
From the syllabus:
You will need to work with your group outside of class to get this project done.
We will have a fair amount of in-class time, but to make the most of that time (getting quick answers to coding or planning stumbles) you need to come prepared.
Adv Git
SDS 270: Advanced Programming for Data Science