Original Source Here
I’ve been writing technical articles on the medium for a year, and have been featured in all major publications including TowardsDataScience, TowardsAI, The Startup, Analytics Vidhya, DataDrivenInvestor, LevelUpCoding. But this thing has always bothered me, “Which day would be best to publish my article?”.
To answer the question, I finally decided to perform Exploratory Data Analysis on articles published by the top 6 ML/AI publications on Medium on daily basis.
The top publications that are crucial to my dataset are:
To get the answer to our dilemma, we need to scrape data from the medium. But, we can’t just scrape featured/pinned articles, rather get a better idea we need information about all articles of each publication for a longer duration of time and not just for a day/week; maybe that day/week was an outlier and can divert us from the general trend.
When we visit Medium, there is no option to download publication’s data as similar to download user’s data and neither Medium supports an API to collect the information.
To get our work done, we need to do some makeshifts and get the data. When we usually go to any publication’s homepage, all we can see is featured/pinned articles.
But our data demands every article that is published on a particular date so that our analysis could be as precise as possible.
But, if you append “/archive/year/month/day” to your publication’s URL, most of our work is done. The page includes summary information about all the posts on that day.
To scrape the relevant information, the pipeline I used is as follows:
- Choose top AI/ML publications.
- Choose the duration for scraping; I scraped data for the year 2020 and mid-2021 i.e. till 13th May.
- Select random ’n’ days ( for 2020, I used n=30 and, for 2021, I used n=15)
- Scrap data for each publication for each random day.
Following the above pipeline, the information that I extracted for each article is as follows:
- Count of Claps
- Response Count
- Reading Time
- Date of Publishing
The total amount of data collected is 6000 random articles published over 45 days sampled from 17 months.
The code to scrape the information can be found at GitHub.
For more info about scraping Medium article information, check out the amazing blog by Dorian Lazar; link in References.
EDA on Collected Data
The data that we collected is not perfect, it doesn’t contain ‘day of publishing’ but contain ‘date of publishing’ that too in the incorrect format.
Before we start with EDA, we need to clean the data according to our needs.
Now, since the data is cleaned we can move forward with our analysis
The Time Duration during which data is collected.
As mentioned earlier, the data collected is of approximately 17 months starting from Jan 2020 to May 2021.
Total Articles taken in account
The information shows that data of around 6000 articles were scraped over 17 months.
Number of Articles having Subtitle
Out of around 6000 articles, 57% articles i.e. only 3500 articles have subtitles.
The publication with maximum articles.
TheStartup and TowardsDataScience publication are most active publication with approx. 2k articles while TowardsAI and MLearning.ai are least active publications with around 137 and 40 articles respectively.
Publication with Highest Response
As predicted by the trend earlier, TheStartup and TowardsDataScience published the most number of articles thus they’re well received by visitors.
But, surprisingly LevelUpcoding has fewer publications but, yet, it counts a higher response rate as compared to AnalyticsVidhya. This can help us to understand that the LevelUpCoding community is very supportive and have awesome content.
Publication with Highest Reading Time
TheStartup publishes the most amount of content as portrayed by our sampled data, yet TowardsDataScience manages to gain higher reading time, better content can be a crucial factor in this analysis.
Rest, the trend remains almost the same, the publication with more article releases enjoys a higher reading time.
Response vs Reading Time
The analysis clearly depicts that lower the reading point higher the response. The analysis backs up our previous prediction that TheStartup has very supportive community and high quality content.
Image Type Used
Medium supports a lot of image extensions. But generic extensions such as jpg, jpeg, png remains popular amongst users.
Best Day to Publish?
We’re back on the most famous question. The analysis shows that most of the articles are published on Monday followed by Wednesday. The least number of articles are published on Friday, Saturday.
So, we can conclude that to attract maximum views and response, authors try to publish their articles on either Monday or Wednesday.
The full code to EDA can be found at GitHub.
Using the power of Exploratory Data Analysis, we tried to analyze the Medium data. The data scraped can’t be labeled as 100% accurate as the data sampled is purely random but still, it leads us to a rough way.
Overall, this was a fun project that helped us to understand the scraping of Medium publications and taught us the unlimited exploration power of visualization libraries.
If you like this article, please consider subscribing to my newsletter: Daksh Trehan’s Weekly Newsletter.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot