Original Source Here
Data Science is as Data Science Does
Despite being in the profession for years, it took me some time to get out of the if-then mindset of my day-to-day and begin to analyze how data science could be useful to me (not just the companies I have helped to apply it).
It all started with a pause in my day to take stock of the tasks that were taking up most of my time.
Settling Family Arguments
My foray into practical data science all started with an inordinate amount of time spent arguing about politics with my family and extended Facebook friends. I found myself looking for data comparing extreme right and extreme left politics, to point out that the two are simply not the same.
But my anecdotes and hard-to-find research articles only seemed to fall on deaf ears. So, to save myself time, a lot of frustration, and arm myself with data I turned my time to data science.
I built an ecosystem that collected data from far right and far left organization newsletters. I wrote a web-scraping script that would grab the top 50 search links using phrases like “current trends in right [left] wing media,” extract the text from those sites, and add some high-level meta-data (e.g. was it a .com or a .org site?).
I then layered on some topic analysis, entity extraction, and sentiment analysis. The final step, package everything up to produce some straightforward graphs showing the differences between the two extreme ideologies.
What was that aunt Tootsy? You think the far right uses less extremist language? Wham, here’s a graph showing how much more often right-wingers use the phrase “war on” compared to left-wingers. Problem solved, time recovered.
This solution now does two things; 1. it gives me a quick, data-driven pulse on politics I can share with my argumentative social circle and, 2. it helps to further polarize me from my family 😊
You know what they say. Time saved is money well spent. On to the next!
Grading Student Assignments
Another area that was taking up a significant portion of my time was teaching. I had been teaching as an adjunct for years. Unfortunately for me, there were times when my teaching load became a force multiplier on my time, significantly jeopardizing my ability to get anything else done.
Something had to be done to make my teaching time more efficient. The big time-suck for me? Grading papers. I was spending anywhere from 5–15 minutes per paper even though I had built a comment dictionary to handle 90% of the problems I had seen in past papers.
Despite my organized comment dictionary, I was still finding that it took time to zero-in on the right comment. To solve it, I needed a way to recommend comments to myself based on the content of each paper.
I first created a folder of all past papers that I had graded for a specific assignment. Next, I extracted the text from each paper, the comments, and the graded score. I organized all of the information at the paragraph level creating a simple dataframe with the paper, the paragraph extracted from that paper, the comment, and the score.
Both the comment and the score were used as targets that I would build models for. The paragraph represented the source of feature engineering that would be used to predict either the comment or the score.
To keep things extra simple, I used spaCy’s document categorization pipeline to train a model that would recommend a comment given a paragraph. I used a simple regression model for predicting the score from key words extracted out of the papers as features.
The final solution would consume new papers, break out the paragraphs (any passage over a certain character length), and provide a recommended comment for that paragraph with a confidence score. My model also predicted the score for the paper based on a wholistic look at all paragraphs.
Armed with both tools, I reduced my grading time down to less than 5 minutes per paper. I still review each paper manually, but my model provides quick access to the most likely feedback required for each paragraph.
Coming Up with Creative Content
The first two examples fall under the umbrella of intelligent automation. Taking something repetitive, but not easily described as a series of if-then rules, and developing statistical models to help automate those tasks without having to write a million if-then statements.
This last example has more to do with creative inspiration than intelligent automation, though it did help save me time in the end.
A few years ago, my son and I started a t-shirt company. He was really good at drawing, I was really good at cheaply converting his drawings into digital images. Once converted, he could color, enhance, and further refine to his artistic content.
At first, we were excited, and the creative juices were flowing. T-shirts were flying, robots were being drawn (he has a special niche with robots), and energy was up. But then…sales took a steep slump after all the papas and nanas of the family had purchased their limit. The novelty had worn off and we were suffering from some major creative roadblocks.
“Me: Son, our t-shirt production is slumping. You must get back to drawing. The business is relying on you. The fate of your future lies in the balance.
Son: Dad, don’t be so dramatic. Also, I don’t know what else to draw. I’m running out of inspiration for new robots.”
That’s when I discovered GANs, or Generative Adversarial Networks as they’re known more formally. A GAN is two neural networks, one generating output from random inputs and another trying to classify the outputs from the generative model (as well as a mess of other training data) as real or generated. The networks work against each other such that the generative model tries to learn how to generate output that fools the classifier into labeling it as real.
As I learned about GANs and got inspired by some of their implementations, such as DALL-E from OpenAI, I decided to try building a GAN that could draw robots like my son.
Okay, so I don’t want to fluff this last solution by saying how I successfully built a GAN to draw like my son in just a few minutes. Quite the opposite and, in fact, am still working on it as of this post.
Yeah, GANs are hard to train. They take a very long time to train, are very sensitive to tuning parameters, and take a very long time to train…wait, did I already say that?…for emphasis then.
At the risk of disappointing you, this one will have to remain a work in progress. But the potential is going to be huge…trust me on this one 😉
As I see it, my GAN will soon be able to generate novel images that we can land on shirts to greatly up our production. Moreover, the outputs can also be used to further inspire my son and serve as a creative crutch when he’s feeling less than inspired.
Content writers are even starting to toy with these generative applications as OpenAI’s latest GPT-3 model can generate novel content with a few simple prompts.
Controversial yes, but if used responsibly these tools may be a useful addition to any creatives tool kit.
For now, I leave you with the latest output from my generative model. After days of training, my model can draw a line. I promise, it’s a line 😊
Hopefully, your own practicality-juices are now properly flowing. Data science can be useful and there are a number of tools I haven’t covered that can be leveraged to help solve a great many other practical problems for you.
I hope you seek them out, experiment with them, and maybe even find yourself getting some value. At minimum, you will have learned something along the way.
Like engaging to learn more about data science, career growth, or poor business decisions? Join me.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot