This is a web data dashboard that maps the connections between subreddits from the social media page Reddit.
Using data from the Reddit API and synthesized by the Community Data Science Collective, this dashboard visualizes attributes about subreddits (i.e. # of distinct posters, % of media posts) and relationships between subreddits (i.e. # of cross-posts).
There are also deeper insights about the competition and mutualism between subreddits (when one subreddit gains subscribers, another subreddit gains or loses subscribers).
The hope for this project is that it will be used by Reddit users to explore new subreddits and by data researchers and enthusiasts to more deeply understand the Reddit ecosystem.
This is a culminating project by students at the University of Washington studying Human-Centered Design & Engineering, sponsored by the Community Data Science Collective in 2022.
Check out our GitHub repo.
Learn more about how to use Reddit Connections with this 2-minute video.
Dashboard Coordinates:Subreddits that are close together have a significant overlap in users (posters and commenters). This is generated using U-MAP encoding with HDBSCAN.
Topic Clusters:Subreddits that have a significant overlap in users (posters and commenters).
Unique Posters: Number of unique accounts that post to the subreddit.
Unique Commenters: Number of unique accounts that comment in the subreddit
Average Post Length:Average length of post (in characters) in a subreddit.
Average Comment Length: Average length of comment (in characters) in a subreddit.
Interactions (comments/post): The ratio of comments per post in a subreddit.
This indicates, on average, the level of interaction of posts in the subreddit.
NSFW %: Percentage of posts in a subreddit that are tagged as “Not Safe For Work”. This can include content “graphic, sexually-explicit, or offensive”1
Average Post Score: Average score for each post, where the score is approximately the # of upvotes minus the # of downvotes.
Mean Comment Score: Average score for each post, where the score is approximately the # of upvotes minus # of downvotes. This indicates, on average, how upvoted the posts on the posts on a subreddit are.
Comment Moderation: The number of deleted comments divided by the total number of comments in a subreddit. This indicates, on average, how heavily moderated a subreddit is.
Post Moderation: The number of deleted comments divided by the total number of comments in a subreddit. This indicates, on average, how heavily moderated a subreddit is.
Media Posts %: The number of media posts divided by the total number of posts in a subreddit.
Text Posts %: The number of text posts divided by the total number of posts in a subreddit.
Links %: The number of link posts divided by the total number of posts in a subreddit.
Cross Posts %: Cross-posts are posts that are the exact same, but have been posted in multiple subreddits. This measures the number of cross-posts between two subreddits.
Term similarity: Measures the similarity of terms in two subreddits compared to terms across all subreddits. Specifically, TF-IDF vectors are used to fit an embedding model (using SVD) that projects them into a lower dimensional space. This technique is known as latent semantic analysis
Author similarity: Measures the similarity of authors in two subreddits compared to authors across all subreddits. Specifically, TF-IDF vectors are used to fit an embedding model (using SVD) that projects them into a lower dimensional space. This technique is known as latent semantic analysis.
Mutualism & Competition: Mutualism is an ecological interaction where growth in the first group leads to growth in the second. Competition is when growth in the first group leads to decline in the second. Ecological interactions can be mutualistic in one direction and competitive in the other and mutualism (or competition) from one group to another group may (or may not) be returned in kind.
11 The dashboard excludes all subreddits with over 80% NSFW posts.