This long journey has come to an end… almost. Although this is the end of the blog, I have some extra work to do in terms of finishing this project. But I wanted to use this final blog post to go over this journey, what I learned, and what’s next.
From an academic standpoint, I learned that a great deal of the battle in machine learning is the data. The data never seemed to be in the right form or even readily available in my project, and that was the very thing necessary in order for the algorithm to become smarter. Additionally, it was an extremely important realization that coding is not streamlined well. I spent a great deal of my time simply trying to connect my programs to one another and ensure they were communicating correctly. It was like bureaucracy except for coding, which reminded me the intricacy of large scale projects.
From a personal standpoint, I believe I really improved with my time allocation, and also improved my understanding of the importance of data. This Facebook scandal really didn’t hit hard with me because I didn’t realize how valuable data truly is in understanding a person, so now I’m definitely more cautious about what I put out there.
In terms of what went right or wrong, I’m not going to go very in depth in it- for that you need to watch my presentation- but it’s fair to say many things went wrong more than they went right. What went right was the machine learning algorithm training and variable decisions, as it was fairly easy to train the algorithm when I had the data and the variables were fairly easy to decide upon after conversation and research. What went wrong? I’ll need a list for that:
- Datasets- There were no clear datasets, so I had to make the dataset myself
- Scraping- The scraping tool wouldn’t work because the servers were locked so I had to hand transcribe
- Environment- The developer environment I wanted to use was not working on computer for some reason, so I had to switch
- Programs- Not enough available data for me to recommend programs, so I had to switch to colleges
- List- Algorithm wouldn’t recommend lists, only one answer so I had to make that switch.
I’ll explain more at my presentation. As for my final project, my final project is currently a Python Program, but I’m working on connecting it to a well designed website that will look aesthetically pleasing. Hopefully that works out great. As for my presentation, I will be presenting on May 23 at 6! As for recommendations for future projectors, I recommend take your time and hit the ground running! Do most of your research for the project well before the project, because slacking on that will lead to a hard twelve weeks.
I want to thank Mr. Lal, Mr. Lizardo, and Ms. Belcher for their help in my project. Their helps has been so important to the progress I’ve had, and I couldn’t be more grateful.
Following up from last week, there’s some important issues that had to be addressed that were. First, the algorithm now takes all data and reads it great. There’s no problem of CSV being read and understood with this, and this has led to a far great improvement of efficiency in terms of refining the data. Quick plug and chug per se. As for the readability and simplifying the jargon and reducing data input errors to zero, I made the shift from words to numbers. Here’s the decision I made in order to reduce the overhead and reduce the work for the ML algorithms.
Political Science- 2
Computer Science- 3
Pre Med- 10
Sports, Sports Leadership, Related EC, Related EC Leadership, Volunteer, Liberal Arts
This looks longer than it is, but some important things to note:
- I did not use data from all fifty states. This means that this college algorithm is not all encompassing. Apologies, but I knew this coming in that the dataset was not going to help.
- I made some business decisions. Maybe it’s the inherent elitism I have as a STEM major or the lack of data from drama and theater majors, but I grouped them into English.
- This simplified the dataset.
Because of college visits, I wasn’t able to apply the algorithm to the revised dataset with 25 entries for every college. That took most of my time.
Eish the Computer Scientist 🙂
I hope everyone had a terrific Spring Break! In the last two weeks, I made some huge progress in the project that I’m excited to share. Let’s delve into it. As I’ve stated numerous times earlier in my blog posts, I’m using the Anaconda Navigator workspace to execute my project. So the first step was to connect the CSV file into the path so it could be accessed by the Python libraries I was going to run on it. Now, I originally was going to have my libraries read an Excel file, but through research and GitHub, I realized that it would require less hoops to simply export it as a CSV file. Now, after that was done, it was a matter of importing the required libraries. Here are all of the libraries I exported(I won’t go into specific imports):
- scipy- This is a library that streamlines the capabilities of Python for mathematics, software, and engineering. I’m able to do comprehensive plotting with this library so I can understand deviations of each variable, spreads, etc.
- numpy- Like scipy, used for statistic modeling, linear algebra, etc.
- matplotlib- Helps with graphing my data on the math plot.
- pandas- This is the most important library that I am accessing. That is because it’s the middleman between numpy and the next library I will be noting, and it allows for me to create high level, efficient data structures.
- sklearn- This library allows me to apply machine learning algorithms upon my data. It’s built on scipy, numpy, and matplotlib, thus why it’s necessary to import those libraries. I will be importing specific algorithms from this library.
After importing, I got onto the task of ensuring the data was readable and would be imported correctly and well. That was a fairly easy task, and I wanted to see the data distribution by university.
This is a problem. This data distribution needs to be fixed, because only four examples of STAN yet 68 of UCLA is not efficient for my program. I will aim to fix this next week by equalizing the examples to about 25-40 each.
After I checked this, I moved on because I was curious about the effectiveness of the algorithms on this data. This is where the big error arose. Now, the code to run the algorithms is fairly straightforward-ish. But I kept on getting an error that would prevent me from seeing the efficiency of the algorithms. Variables in my program X and Y were used to represent columns and rows, but for some unknown reason the Y variable was failing to be compiled, yet the X was not although the coding was the same. I asked my Computer Science teacher Mrs. Visa for help, and she didn’t understand it either. After much tribulations and tinkering, I realized Y was being read as a simple object instead of as an integer. I coded in the small problem, and the results came out. The results were massively impressive and far exceeded expectations.
All of them, except logistic regression, had 85%+ accuracy. And CART, Cartesian that is, had a 93% success rate! This is great news, and bodes well for when I fine tune my data and equalize it, for possibly 97%+!
Excited for what I’ll be sharing next week!
This weekly update is different from the others because for once, I won’t be talking about data accumulation! So this week, I didn’t have much time to work as other weeks because I am currently in Costa Rica. I frontloaded my hours on Monday-Tuesday, but I still got some ample work done.
The first and most important step was ensuring that the path to the Excel file was accessible to the app I was using to code the machine learning algorithm, the Anaconda Navigator.A I’m not too tech savvy for a programmer, so it took me awhile to set the file path straight. The problem with my computer is that it was used before by a user named Keith Ramee, but then I changed the name to mine. But for some reason, that did not change the file names such as KEITHPC and the main program data files. That is, on my screen. On the actual backend and BIOS, the name is actually EISHPC. That took me forever to figure out, because how was I supposed to? Anyway, the job was done.
Afterwards, I began coding the basic structure of the machine learning algorithm, and that takes me making some assurances. Probably the most important is making sure two variables don’t exactly line up or tell the same causation metric, and you can check that through Python. There wasn’t any of that at first glance by Python, but with the more complicated algorithms I apply the more I will have to revisit.
Then was splitting up the training data and test data. This is what took up most of my fifteen hours, and that was because this is the most important post-data accumulation task. I just have to make sure there’s enough data points for every Pac-12 school and enough majors accounted for in both sets. The split should be 75-25 or 80-20, so I’m currently sifting through that.
As for machine learning algorithms, I have been doing some light research on them. My next blog post should be an in depth analysis of that, so watch out for that.
I don’t have any visuals this week that are pertinent, so take this ML joke to put in perspective that my project is not rocket science.
Some quick updates. I have basically finalized my data set for preliminary testing. I’m looking forward to testing some of the algorithms I’ve been researching on the data sets. I will go over some ML algorithms that I am currently looking at. Keep in mind all of them are supervised, because I have given them the right answers by having admissions results.
-Decision trees: Decision trees essentially follow the trail of, well, a decision tree. The slight nuances in every profile will be understood and registered by the algorithm, and it’ll use that to create a broad “tree” to use. Take this image as example-
-Naive Bayes: This is the complicated algorithm I am leaning towards, and it’s hard to explain but remarkably efficient. Essentially, Naive Bayes constructs classifiers- models that assign labels to problem, represented as feature values, where the labels are drawn from a finite set. Additionally, it’s a family of algorithms all based off of the same logic- the Naive Bayes theorem. Doesn’t make sense? I’ll break it down. I’ll start with the abstract. In the abstract, Naive Bayes refers to a conditional probability model which when given a problem instance(a concrete utterance of an abstract problem- a.k.a. a real life problem) will assume that every feature of the problem is independent of all the others. So building off of that, I will be using the model and training it with the data to begin the weighting of every variable.
-Nearest Neighbor: If everything goes wrong, this is what I might have to use. It uses a dataset and tries to find its closest companion. Pretty simple to explain, but it still counts as machine learning because it will have to constantly update itself to accurately predict other results.
Those are the algorithms I’m looking at, and I’ll keep you updated on which ones I choose and how the data interacts.
My time this week was spent continuing the data collection I’ve been doing over the past few weeks, and needless to say it is tiring me out. I’ve finished UCLA, UC Berkeley, USC, Arizona, Arizona State, and am now moving onto Oregon State and Oregon. It has been a painstaking process for three main reasons:
- Inefficient coding- Since COLLEGEData is coded like a site from the early 2000s, I wasn’t able to apply any web scrapers to the locked away data. So, in order to compensate for this inefficiency, I’ve had to save profiles to my data locker, then scour each one to ensure every data value is needed is there in a profile and then record it in an Excel Spreadsheet.
- Time- As I’ve said a million times before, I need alot of data for my project. Having to manually insert it cell after cell takes more time than it should, and it’s began taking a toll on me. This is the part of machine learning that really honestly should be automated, but because a scarcity of data is available, I’m forced to deal with this problem.
- Data Entry- The data entries into COLLEGEData are extremely lacking. People only fill out a certain amount of data because everything is optional, and that’s frustrating as someone dealing with these inadequacies. It has caused me to take thrice the time it should.
Now, I also want to go into some things that stand out to me. I want to preface this by saying that data is king. You can argue with me, but the facts don’t lie. There us a key takeaway from the data I’ve accumulated, and yes you can argue small sample size, but I’d counter by asking, “What’s yours?” Girls and socioeconomically disadvantaged minorities, as it seems from my data(important qualifier), are favored in STEM. Refer to my earlier posts about what data entries I take and qualities I record, but there have been thirty different instances in my data where the only difference between two applicants in terms of their profile is identified gender, yet females consistently outperform their male counterparts in getting into the top Pac-12 schools. Four out of six instances occurred with UCLA. Now, I do want to point out some flaws in my own data, which is that these are only the Pac-12 schools, and six different instances. These are the six instances I can think of off the top of my head, and yes the Pac-12 argument is valid. Also, the profiles I am recording are not all expansive and such not be treated as such, but I was ran by the data set through my college counselor, and those profiles were considered proficient. This could also lend itself to the argument that what admission officers look for is fairly qualitative but I’m not a big believer of that. After all, my project centers around data. I’ll do more research into this takeaway, but I thought I should share this.
That’s all for this week, and I’ll update you on what comes next. I’m aiming to be finished with data collection next week, so the fun will start soon!
This week was a continuation of what I was doing last week, which was data collection. I now have 200 concentrated and complete data points that have been taken from the COLLEGEData userbase. As of now, most of the data I have been mining for is UCLA, USC, and UCB based, because I’m finishing the whole COLLEGEData userbase for every Pac-12 college. It’s been going great, and I’m looking forward to this data assimilation to be done next week.
In relation to the title, AdmitSee responded to my request weeks after I sent my email, conveying interest in my project. This is a missed opportunity for this project, because now it’s too late to collaborate. However, I’m going to be working with AdmitSee and possibly interning over the summer due to my interest in continuing and refining this project.
Talk to You Next Friday,
What’s up peeps,
This week was a fairly boring week, because I continued working on everyone’s favorite task: data collection! I had some uninteresting and fairly irritating developments, but nonetheless important ones.
Task 1: Web Scraping
This was the first development, and it was not a good one by any means. The web scraper I developed I began using to scrape, and I ran into a huge problem. The database- I’ll get to which database later- I was scraping would not grant my request to scrape, which largely reduced the efficiency of my task. At first, I thought the problems was that the cookies were not being stored, because of an error I received. So I then set up a Cookies Jar in my code, which solved that error, but unearthed a far more fatal one: that the server would not grant my request. This means that I now have to manually input my test and training data into the Excel spreadsheet.
Task 2: Database
This leads me to the second task, which was selecting my database. I have decided to move forward with COLLEGEData, mainly because it was my only available choice. Ms. Belcher rejected the BASIS concept, and AdmitSee did not respond back. Additionally, the data provided by COLLEGEData allows me to create a pretty comprehensive engine, and the data spans years 2007-22, so I know it’s long term. The data will take some time to unearth, but it’ll be worth it.
Those are my basic updates, as the data accumulation phase continues. I hope to have 500 data entries by the next week.
This week had some interesting changes in path and some important revelations that are going to be key in the long term of this project. Let’s get into it.
Out of the two deliverables that I had hoped would be done by this week, one was done:
- Web scraping and data collection
Using BeautifulSoup4 and Jupyter Notebook(Python through the Anaconda Navigator module), I created a functioning and efficient web scraping system that would return stock prices on a given site. This program was a starter example program, and now in the coming weeks I will be using my web scraper to accumulate much of the important data. As for the other deliverable, I did not finish it because my focus shifted. The Python Beginners course deliverable had to take a backseat, because after talking to my advisor and getting his input, we agreed it wasn’t necessary for me to take a extremely broad Python Beginners course when for this project, I only needed some Python skills. So we turned our focus to the elephant in the room: data. For my project, the main problem is a lack of data. From a game theory perspective, colleges have no incentive to publicize admissions data, because then people like me could use that data to find trends and essentially hack the system. They want the admissions system to be shrouded in secrecy because then, no one can ask questions. Unfortunately for them, I’m a Curious George that’s looking for data. Unfortunately for me, they have a monopoly on the data. There’s no federal regulations that force them to disclose applicant profiles or anything of the such. This is a problem for my data intensive project. So what I’ve done is realized I must start making some assumptions or narrowing the data pool. Here are some routes I’ve realized might work.
- Turn this machine learning algorithm into an algorithm specifically geared to STEM students who are Asian American located in the Bay Area. This would be fairly easy data to get for me, because the majority of the people I know fit into this. I could ask or pay them for the data. However, this is extremely restrictive. It’s three extremely restrictive assumptions, so I do not want to that.
- Make the data set geared to BASIS. In this route, it would require I accumulate data from all counselors from every BASIS school. I don’t know if BASIS would allow this, but I sent emails to Walker and Belcher asking for the possibility if I could. The data set problem would be solved.
- Use AdmitSee. AdmitSee has the largest and most complete database, but they are a for profit company and their secret sauce is their database. I don’t know if they would be willing to share, but they have thousands of applicant profiles that would complete my engine. I emailed them.
- Use CollegeData. This is the most likely route, but the problem is that their datasets aren’t always complete, so it will take a while to scrape and assimilate all the information.
That’s the main goal for next week- choosing a route. It’ll be interesting to see what works out, and I’m excited to update you.
This was a hectic and exciting first week, as the project hit the ground running. Here are the deliverables that were finished this week:
- Finished PluralSight Course: Intro to Machine Learning
- Finished PluralSight Course: Intro to Machine Learning Using Python
Here are the deliverables that were started and hope to be completed by next week:
- Data collection and scraping from the internet- Finding the right conference
- Finishing PluralSight Course: Python for Beginners
In terms of the finished deliverables, there was a lot that was learned and improved upon this week. The first is machine learning itself. It would be a lie if I told you before this project I was an expert on machine learning, because it was an interesting subject that I have little context of. The goal of this project was so I could become an expert in the field, and the two Courses I finished gave a great deal of helpful context and aid. I made my very first machine learning algorithm that helped identify diabetes at a 70% success rate clip- fairly good, actually for a basic algorithm- and also learned some machine learning truths that change my perspective on my project.
- Clean data is the key. I came into this project thinking I optimized for data scraping and cleaning- 4 weeks- but through the course curriculum, I realized that I might need to spend more time. The machine learning algorithms don’t have to be constructed, rather the right one needs to be applied. The problem is finding the data sets that will be applied upon. Having the right data that doesn’t repeat itself or cause problems is absolutely important.
- A high success rate is impossibly hard. When I say high success rate, I mean 80-90%. The difficulty is in getting those last 20-30%, and in fact a 70% success rate considering my time for this project would be impressive. It’s an important truth that I really didn’t think much of, but it’s important I learned this now and adjusted my goals.
This week was a successful start to the CollegeProRec project, and I will continue to progress through some the deliverables I listed earlier. I’m excited, and I’ll keep you updated!