Skewed Data on a Scatterplot
Continuing the discussion on skewed data after the previous article of Skewed Data on a Bar Chart, it is more common to encounter skewed data on numeric data, therefore, it is more likely to deal with skewed data on a scatterplot than on a bar chart. One of the classic examples is displaying the relationship between GDP (Gross Domestic Product, which measures a territory’s income) and population. If we plot the data from the territories on GDP and population, you may find the observations are very skewed to one side like this:
Unfortunately, most of the countries are clustered in a lower left area. At the same time, the scatterplot does not clearly show the relationship between GDP and Population.
Like the previous article suggested, you may take a logarithm to the data point. If you are visualizing the data with Python and Plotly, you don’t need to manually take a logarithm to each data point, but rather simply pass the logarithm as a parameter in the layout setting like below:
data = []
data.append(go.Scatter(x=df['Population'], # No need to take log
y=df['Nominal_GDP'], # No need to take log
marker_color=df['color'],
text=df['Territory'],
hoverinfo='text',
mode='markers'))
layout = {'title':{'text':'Nations\' GDP vs Population', 'x':0.5},
'xaxis': {'gridcolor': 'lightgray',
'type':'log' # Add this parameter to take a log on x-axis
},
'yaxis': {'gridcolor': 'lightgray',
'type':'log' # Add this parameter to take a log on y-axis
},
'plot_bgcolor': 'rgba(0,0,0,0)'}
fig = go.Figure(data=data, layout=layout)
Once you have passed these arguments to Plotly, it will generate the scatterplot like below:
Now, not only it is more readable by declustering the observations, but also it is more clear to show the upward-sloping relationship between GDP and population.
The scripts for generating these scatterplots can be found on my Github
My LinkedIn: