Data Science Fun with the OOW Mix Session Voting Data

Over the past few weeks Oracle Mix had opened the Oracle OpenWorld 2011 Suggest-a-Session to the general public where anyone could submit or vote on a session. One limitation of the Oracle Mix site was that it was impossible to sort the sessions by votes but that challenge was tackled by Roel Hartman with his blog post and APEX demo. After seeing the top session by votes, it was very interesting to me that around half of the top 15 sessions were all from the same author. That got me thinking…and that thinking turned into a little data hacking project that I embarked on. Now I admit it, I think data is very cool, and even cooler is extracting patterns and neat information from data.

Getting the Data

The Oracle Mix site is very “crawler friendly” — it has well defined code and tags which made extracting the data fairly painless. The basic process I used came down to this:

  1. Get the listing of all the session proposals. That was done by going to the Mix main proposal page and walking all 43 pages of submissions, scraping the direct URL to each session.
  2. Now that I had all of the session abstract URLs, grab each of those pages, all 424 of them
  3. From each session page, extract the relevant bits: Session Name, Session Author, Total Vote Count, and most importantly, who voted for this session.

I did all of that with curl, wget and some basic regex as a “version 1″ but was hoping to go back and try it again using some more sexy technology like Beautiful Soup. That will have to be continued…

The Social Network Effect

With Oracle Mix Suggest-a-Session, people generally vote for a session for one of two reasons:

  1. They are generally interested in the session topic
  2. The session author asked them to vote because of their social relationship

What I think is interesting to know is just how much of the voting is done because of #2. After all, Oracle Mix is a social networking site so there certainly is some voting for that reason. In fact, one of the session authors, Yury Velikanov from Pythian, even blogged his story of rounding up votes. The data shows us this, but more on that in just a bit…

The (Unofficial) Data

I took some time to mingle around the data and found some very interesting things. Let’s just start with a few high level points:

  • There were 424 sessions submitted from 252 different authors.
  • There were 10,125 votes from 2,447 unique voters.
  • The number of submissions ranged from 1 to 24 per author.

Here are some interesting tidbits I extracted from the data set (apologize for not making a cool visualization chart of all this – but I’ll make up for it later):

-- top 10 sessions by total votes:
+-------------+-----------------+--------------------------------------------------------------------------------+
| total_votes | session_author  | title                                                                          |
+-------------+-----------------+--------------------------------------------------------------------------------+
|         167 | tariq farooq    | Oracle RAC Interview Q/A Interactive Competition                               |
|         156 | tariq farooq    | Database Performance Tuning: Getting the best out of Oracle Enterprise Manager |
|         137 | tariq farooq    | Overview & Implementation of Clustering & High Availability with Oracle VM     |
|         130 | tariq farooq    | Migrate Your Online Oracle Database to RAC Using Streams and Data Pump         |
|         127 | tariq farooq    | 360 Degrees - Achieving High Availability for Enterprise Manager Grid Control  |
|         126 | yury velikanov  | Oracle11G SCAN: Sharing successful implementation experience                   |
|         124 | sandip nikumbha | Accelerated Interface Development Approach - Integration Framework             |
|         123 | tariq farooq    | Oracle VM: Overview, Architecture, Deployment Planning & Demo/Exercise         |
|         123 | sandip nikumbha | Remote SOA - Siebel Local Web Services Implementation                          |
|         119 | yury velikanov  | AWR Performance data mining                                                    |
+-------------+-----------------+--------------------------------------------------------------------------------+

-- top 10 voters (who place the most votes)
+--------------------+--------------+
| voter_name         | votes_placed |
+--------------------+--------------+
| arup nanda         |           53 |
| tariq farooq       |           43 |
| connie cservenyak  |           36 |
| xiaohuan xue       |           36 |
| bruce elliott      |           36 |
| peter khoury       |           35 |
| yugant patra       |           35 |
| balamohan manickam |           35 |
| suresh kuna        |           34 |
| eddie awad         |           34 |
+--------------------+--------------+

-- top 10 voters by unique session authors (how many unique authors did they vote for?)
+--------------------+----------------+
| name               | unique_authors |
+--------------------+----------------+
| arup nanda         |             28 |
| paul guerin        |             24 |
| eddie awad         |             24 |
| bruce elliott      |             23 |
| xiaohuan xue       |             23 |
| connie cservenyak  |             23 |
| peter khoury       |             22 |
| wai ling ng        |             22 |
| yugant patra       |             22 |
| balamohan manickam |             22 |
+--------------------+----------------+

-- top 10 session authors by total votes received, number of sessions, avg votes per session
+---------------------+-------------+----------+-----------------------+
| session_author      | total_votes | sessions | avg_votes_per_session |
+---------------------+-------------+----------+-----------------------+
| tariq farooq        |        1057 |        8 |              132.1250 |
| yury velikanov      |         557 |        5 |              111.4000 |
| alex gorbachev      |         429 |        6 |               71.5000 |
| sandip nikumbha     |         360 |        3 |              120.0000 |
| syed jaffar hussain |         281 |        4 |               70.2500 |
| kristina troutman   |         233 |        5 |               46.6000 |
| russell tront       |         221 |        3 |               73.6667 |
| wendy chen          |         217 |        3 |               72.3333 |
| asif momen          |         184 |        2 |               92.0000 |
| alison coombe       |         183 |        5 |               36.6000 |
+---------------------+-------------+----------+-----------------------+

Diving In Deeper

I could not help noticing that Tariq Farooq had the top 5 spots by total vote count. I would assert that is related to these two points:

  1. Tariq has some very interesting and apealing sessions
  2. Tariq has lots of friends who voted for his sessions

I have no doubt there there is some of both in the mix, but just how much influence on the votes is there from a person’s circle of friends? Or to put another way: How many voters only voted for a single session author? Or even more interesting, how many people voted for every session for a single author, and voted for no other sessions? All good questions…with answers that reside in the data!

-- number of users who voted for exactly one author
+---------------------------+
| users_voting_for_1_author |
+---------------------------+
|                       828 |
+---------------------------+

-- number of voters who voted for every session by a given author
-- and total # of votes per voter is the same # as sessions by an author
+-------------------------------------------------+
| users_who_voted_for_every_session_of_an_author |
+-------------------------------------------------+
|                                             826 |
+-------------------------------------------------+

Wow – now that interesting! Of people only voting for a single session author, just two of them did not vote for every one of that author’s sessions. That’s community for you!

Visualizing the Voting Graph

I was very interested to see what the Mix Voting Graph looked liked, so I imported the voting data into Gephi and rendered a network graph. What I was in search of was to identify the community structure of the voting community. Gephi lets you do this by partitioning the graph into modularity classes so that the communities become visible. This process is similar to how the LinkedIn InMap breaks your professional network into different communities.

Here is what the Oracle Mix voting community looks like:
mix_voting_graph

This is a great visualization of the communities and it accentuates the data from above – the voters who only voted for a single author. This can be seen by the small nodes on the outer part of the graph that have just a single edge between it and the session author’s node. Good examples of this are for Yury Velikanov and Tariq Farooq. Also clearly visible is what I’d refer to the “Pythian and friends” community centered around Alex Gorbachev and Yury Velikanov in the darker green color. There are also several other distinct communities and the color coding makes that visible.

Shouts Out

This is my first real data hacking attempt with web data and using some of the tools like Gephi for the graph analysis. One of my inspirations was Neil Kodner‘s Hadoop World 2010 Tweet Analysis, so I need to give a big shout out to Neil for that as well as his help with Gephi. Thanks Neil!

And One Last Thing

So what are people’s sessions about that were submitted? This Wordle says quite a bit.

Source

If you wish to play on you own: https://github.com/grahn/oow-vote-hacking

Addendum

Here is another graph where the edges are weighed according to votes to an author (obviously related to number of sessions for that author).
mix_vote_graph_weighted_edges

25 comments

  1. Alex Gorbachev

    Fascinating analysis. Really liked Gephi graph… smiled at “Pythian and friends community”.
    Surprised not to see OakTable “cluster”. Besides myself, could barely notice Marco and Frits. I recall I voted for quite a few of them and I saw many of other votes.

    I think that size of the circle is also influenced a lot by number of presentations. Would be interesting the see such chart on per-session basis.

  2. Dominic Delmolino

    So, does that mean that 1/3 of all voters only voted for an author, rather than a set a submissions from multiple authors? And since you must vote 3 times in order for your vote to count, does that mean that each of the voters in this set voted for 3 or more papers from the same author?

    Interesting. I was guessing that the 3 vote floor was designed to encourage diversity of authors and papers, but it appears to be overwhelmed a bit by folks only voting for papers from one author.

  3. Greg Rahn

    The size of the node is influenced by the number of presentations as the edge represents a vote and but is not weighted by the number of votes between nodes – so more presentations means more possible votes from another node.

  4. Kyle Hailey

    Super cool. Thanks for not only the cool analysis but the cool tools and visualizations!
    PS do you have high res versions of the network graphs?
    Maybe next year they can make the criteria, you have to vote for submissions by at least 3 different speakers.
    – Kyle

  5. Greg Rahn

    Of the 2447 voters, 1992 voted for 3 or more sessions while 455 voted for only 1 or 2 sessions – so just over 18% did not make at least 3 votes. By Mix voting rules, I would think these votes may not count.

  6. Greg Rahn

    Thanks, unfortunately I wasn’t successful in getting a hi-res version exported from Gephi, but I did amend the post with a link to the project on github where you can use Gephi on the projects directly and zoom in/out.

  7. Guenadi Jilevski

    It will be interesting to know the breakdown of the voters into attending and non-attending. How the graph will look like?

  8. joel garry

    Echo+ the other comments.

    There is a poster about 20 feet long of this kind of network graph in a math department building at UCLA, depicting the relationship between various academic disciplines.

    I’m kind of surprised at how few votes there are. Hundreds out of what, 30K attendees? But at least the quality of submissions and presenters is high.

    Wordles bug me. They’re kind of like shouting spams crawling all over each other trying to get my attention.

  9. CONNOR MCDONALD

    You could propose on Mix, a session about how analyze the proposals on Mix, which of course would bias the results….so you’d need to propose on Mix, a session about the proposal on Mix about the analysis of sessions on Mix, which of course would bias the results….so you’d need to propose on Mix, a session about the proposal on Mix about the proposal on Mix about the session on Mix about the analysis of sessions on Mix.

    …you can see where I’m heading with this :-)

  10. Paul Guerin

    I enjoyed the analysis and visualisation tools here…. but did Oracle really expect authors not use their networks to secure votes?
    I would of thought it’s in the best interest of Open World for authors to use networking to secure votes which in turn promotes this event….. and now we are blogging about it which raises the profile of Open World even more…..

  11. Arup Nanda

    Wow! while being fascinated by the super cool analysis and even cooler visualizations, I couldn’t help but notice my own name at the top of two lists – the number of votes cast by a single member and the votes for unique authors! Not sure if I want to be pleased or embarrassed :) How about a network graph for the voters, grouped into communities of the votes cast – similar to the authors network graph but for voters; not authors?

  12. Tariq Farooq

    This is sick stuff (i mean the cool belieber kindof “SICK”)!!!

    I wish i was as smart as Greg!

    Greg, you need to do an expert webinar @BrainSurface on, how you did all of this: we would ALL love to learn!

    Kudos man!

  13. Kyle Hailey

    It’s great to see that the LinkedIn charting that I’d seen is available for free.
    The controls on Gephi are less than obvious to me, but the it’s awesome functionality.
    The PDF output form Gephi produces a nicely zoomable version.
    I’m wondering what peoples groupings are based on in life – companies? geography? friendship?
    Cool stuff

  14. bex

    I think I speak for a lot of people when I say that I feel the Oracle MIX sessions were hijacked this year… A flood of insincere voters pushed a hoard of copycat sessions from a small number of authors, and unfairly shouted down people who have been valuable contributors in the Oracle community for YEARS.

    Frankly, we should have several rules in place to PREVENT this kind of manipulation. Some suggestions here:

    http://bexhuff.com/2011/07/has-oracle-mix-quotsuggestasessionquot-jumped-the-shark

    Hopefully next year we won’t have this same problem…

  15. Noons

    On the other hand, the “valuable contributors … for YEARS” thing is ripe with the same “manipulation” as is being pointed out here. With the small difference that it is done by a closed “club”.
    Ie: if the voting is open to the community, then the community gets blamed for voting. But if it is closed, then said “club” may take advantage of that – as I suspect they have for a long time.

    Quite frankly: if a number of contributors decides to grade themselves as “valuable”, that is entirely their choice. Not the community’s.

    To me, this is nothing more nothing less than another case of “change the flies”…
    Sorry, but it needs to be said.

    Oh, just to be clear: yes, I voted. On various sessions from a number of authors. I hope no one sees that as a problem?

    And while I’m here: isn’t the voting supposed to be secret? Then WTH is it doing being published in the open? Has that been done as well when the voting was closed? Can we see the contents of such past voting as well? Where I come from, it’s called: “what’s good for the goose is good for the gander”…

  16. John Hurley

    There are a whole bunch of weird things going on here obviously. First not many people use Oracle Mix. Trying to get representative voting out of a self selected set of voters is an obvious issue.

    It blows my mind that people can and do apparently vote so many times. That just does not make much sense to me ( both why they do it and why they are apparently allowed to do it ).

    I never heard of Tariq before ( no offense intended ).

    Not sure what Nuno means exactly by “change the flies” ( never heard that before ) but it did make me laugh. Nuno is generally on target when he calls something out.

  17. Pingback: OpenWord Suggest-a-Sessions @ Oracle Mix | The Pythian Blog
  18. Stefanos

    Assuming that most of these consultants (Tariq, Pythian) do this to generate business,
    is this an effective strategy?
    Very few of these presentations convey any deep expertise as they mostly rehash knowledge easily found elsewhere, but the firms behind them position themselves as the top experts.
    Do companies actually hire them based on these presentations?
    I was just wondering if this is worth all the trouble …

  19. Pingback: DB Optimizer » ASH Visualizations: R, ggplot2, Gephi, Jit, HighCharts, Excel ,SVG
  20. Pingback: Kyle Hailey » ASH visualized in R, ggplot2, Gephi, Jit, HighCharts, Excel ,SVG

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s