RSS

Search results for ‘Gini’

Inequality, Lorenz-Curves and Gini-Index

In a previous post we looked at inequality of profits and the useful abstraction of the Whale-Curve to analyze Customer Profitability. Here I want to focus on inequality and its measurement and visualization in a broader sense.

A fundamental graphical representation of the form of a distribution is given by the Lorenz-Curve. It plots the cumulative contribution to a quantity over a contributing population. It is often used in economics to depict the inequality of wealth or income distribution in a population.

Lorenz Curve (Source: Wikipedia)

The Lorenz-Curve shows the y% contribution of the bottom x% of the population. The x-axis has the population sorted by increasing contributions; (i.e. the poorest on the left and the richest on the right). Hence the Lorenz-Curve is always at or below the diagonal line, which represents perfect equality. (By contrast, the x-axis of the Whale-Curve sorts by decreasing profit contributions.)

The Gini-Index is defined as G =  A / (A + B) , G = 2A  or G = 1 – 2B

Since each axis is normalized to 100%, A + B = 1/2 and all of the above are equivalent. Perfect equality means G = 0. Maximum inequality G = 1 is achieved if one member of the population contributes everything and everybody else contributes nothing.

An interesting interactive graph demonstrating Lorenz-Curves and corresponding Gini-Index values can be found here at the Wolfram Demonstration project.

The GINI Index is often used to indicate the income or wealth inequality of countries. The corresponding values of the GINI index are typically between 0.25 and 0.35 for modern, developed countries and higher in developing countries such as 0.45 – 0.55 in Latin America and up to 0.70 in some African countries with extreme income inequality.

GINI index of world countries in 2009 (Source: Wikipedia)

Graphically, many different shapes of the Lorenz-Curve can lead to the same areas A and B, and hence many different distributions of inequality can lead to the same GINI index. How can one determine the GINI index? If one has all the data, one can numerically determine the value from all the differences for each member of the population. An example of that is shown here to determine the inequality of market share for 10 trucking companies.
Another approach is to model the actual distribution using a formal statistical distribution with known properties such as Pareto, Log-Normal or Weibull. With a given formal distribution one can often calculate the GINI index analytically. See for example the paper by Michel Lubrano on “The Econometrics of Inequality and Poverty“. In another example, Eric Kemp-Benedict shows in this paper on “Income Distribution and Poverty” how well various statistical distributions match the actually measured data. It is commonly held that at the high end of the income the Pareto distribution is a good model (with its inherent Power law characteristic), while overall the Log-Normal is the best approximation.

After studying several of these papers I started to ask myself: If x% of the population contribute y% to the total, what’s the corresponding GINI index? For example, for the famous “80-20 rule” with 20% of the population contributing 80% of the result, what’s the GINI index for the 80-20 rule?

To answer this question I created a simple model of inequality based on a Pareto distribution. Its shape parameter controls the curvature of the distribution, which in turn determines the GINI index. The latter is visualized as color-coded bands using a 2D contour plot in the following graphic:

GINI index contour plot based on Pareto distribution model

The sample data point “A” corresponds to the 80-20 rule, which leads to a GINI index of about 0.75 (strongly unequal distribution). Data point “B” is an example of an extremely unequal distribution, namely US political donations (data from 2010 according to a statistic from the Center of Responsive Politics recently cited by CNNMoney):

“…a relatively small number of Americans do wield an outsized influence when it comes to political donations. Only 0.04% of Americans give in excess of $200 to candidates, parties or political action committees — and those donations account for 64.8% of all contributions”

0.04% contribute 64.8% of the total! Here is another way of describing this: If you had 2500 donors, the top donor gives twice as much as the other 2499 combined. This extreme amount of inequality corresponds to a GINI index of 0.89 (needless to say that this does not seem like a very democratic process…)

As for US income I created a separate graphic with data points from the high end of the income spectrum (where the underlying Pareto distribution model is a good fit): The top 1% (who earn 18% of all income), top 0.1% (8%), and top 0.01% (3.5%).

GINI Index Contour Plot with high end US Income distribution data points

These 3 data points are taken from Timothy Noah’s “The United States of Inequality“, a 10-part article series on Slate, which in turn is based on data and research from 2008 by Emmanuel Saez and visualizations by Catherine Mulbrandon of VisualizingEconomics.com. This shows the 2008 US income inequality has a GINI Index of approximately 0.46, which is unusually high for a developed country. Income inequality has grown in the US since around 1970, and the above article series analyzes potential factors contributing to that – but that’s a topic for another post. In the spirit of visualizing data to create insight, I’ll just leave you with this link to the corresponding 10-part visual guide to inequality:

Postscript: In April 2012 I came across a nice interactive visualization on the DataBlick website created by Anya A’Hearn using Tableau. It shows the trends of US income inequality over the last 90 years with 7 different categories (Top x% shares) and makes a good showcase for the illustrative power of interactive graphics.

Advertisements
 
6 Comments

Posted by on September 2, 2011 in Financial, Industrial, Scientific, Socioeconomic

 

Tags: , ,

World Inequality and the Elephant Curve

In December 2017 the World Inequality Lab (WIL) published its first World Inequality Report 2018. The lab consists of a five-member board and 20+ researchers, mostly from the Paris School of Economics (Thomas Piketty et al.) and the University of California at Berkeley (Emmanuel Saez et al.). Compared to previous work on economic inequality it is fair to say that research has significantly advanced over the last 5 years along several directions:

  • The free report itself is available both online as well as in various download formats and eight languages. It aims to become a data-driven foundation for societal and policy discussions about inequality.
  • All underlying data are openly published (via the World Wealth & Income Database WID) to support reproducibility and stimulate further research.
  • The methodology to aggregate data is encompassing more sources, more attributes (including age, gender, etc.) and better informed estimates, across a wider spectrum of countries and geographies (all important for policy discussions).
  • The visualizations have evolved beyond limited measures such as the Gini-Index and now typically include interactive charts (such as the for example at http://wid.world/country/usa/)

This report is quite detailed and holistic. Aside from the Executive Summary, Introduction, Conclusion and Appendices, it consists of the following five parts:

  1. AIM OF THE WORLD INEQUALITY REPORT 2018
  2. NEW FINDINGS ON GLOBAL INCOME INEQUALITY
  3. EVOLUTION OF PRIVATE AND PUBLIC CAPITAL OWNERSHIP
  4. NEW FINDINGS ON GLOBAL WEALTH INEQUALITY
  5. FUTURE OF GLOBAL INEQUALITY AND HOW IT SHOULD BE TACKLED

There are many interesting findings. Let me just provide three examples in this Blog, together with respective visualizations telling the “story in the data”.

Example 1: Inequality rising everywhere, but at different speeds

Here is a Figure E2a showing the Top 10% income shares across several large geographies over the period 1980-2016:

figure-e2a

From the report’s Executive Summary:

  • Since 1980, income inequality has increased rapidly in North America, China, India, and Russia. Inequality has grown moderately in Europe (Figure E2a). From a broad historical perspective, this increase in inequality marks the end of a postwar egalitarian regime which took different forms in these regions.

and further

  • The diversity of trends observed across countries since 1980 shows that income inequality dynamics are shaped by a variety of national, institutional and political contexts.

  • This is illustrated by the different trajectories followed by the former communist or highly regulated countries, China, India, and Russia (Figure E2a and b). The rise in inequality was particularly abrupt in Russia, moderate in China, and relatively gradual in India, reflecting different types of deregulation and opening-up policies pursued over the past decades in these countries.

  • The divergence in inequality levels has been particularly extreme between Western Europe and the United States, which had similar levels of inequality in 1980 but today are in radically different situations. While the top 1% income share was close to 10% in both regions in 1980, it rose only slightly to 12% in 2016 in Western Europe while it shot up to 20% in the United States. Meanwhile, in the United States, the bottom 50% income share decreased from more than 20% in 1980 to 13% in 2016 (Figure E3).

The latter is apparent from the supporting visualization in Figure E3, contrasting the Top 1% and Bottom 50% national income shares in the US with that of Western Europe:

figure-e3

figure-e3b

Although the y-axis does not start at 0% and is of different scale in both charts, the underlying story, i.e. the evolution of income shares of the rich (top 1%) and lower class (bottom 50%) over the last 35 years is apparent:

  • Income shares have changed significantly in the US:
    • The Top 1% nearly doubled their income share from 11% to 20%
    • The Bottom 50% saw their income share almost cut in half from 21% to 13%
  • Income shares have been fairly stable in Western Europe

 

Example 2: The elephant curve of global inequality

On this Blog we have written a lot about the Gini index. (See Gini posts) One of the limitations of the Gini index is that it reduces the entire inequality picture down to a single scalar value. Multiple distributions result in the same Gini index, which means that structural distribution changes may be masked out by a near constant Gini index.

For example, world inequality over the last 35 years has had both increasing effects (such as growth concentration at the top) as well as decreasing effects (raising hundreds of millions of people out of poverty in India and China). Visualizing the Gini index over time does not show this dynamic well.

Another chart to visualize this dynamic more clearly is the elephant curve – named after the shape of the animal. This curve lists all population groups in percentiles along the x-axis, sorted by increasing income from left to right. The first 99 % have the same x-axis spacing; the top 1% on the right is split into 10 subgroups of 0.1% each; the top 0.1% is again split into 10 subgroups of 0.01%, and finally the top 0.01% is again split into 10 subgroups of 0.001%. This gives a finer resolution near the top of the income distribution, highlighting the very disproportionate accrual of growth at the top. See Figure E4 for global inequality growth from 1980 – 2016:

figure-e4

The big bump on the left (head of the elephant) represents the large number of people lifted out of poverty (mostly in India and China). The steep rise on the right (trunk of the elephant) represents the disproportionate gains at the top of the economic income distribution. Again, from the Executive Summary:

How has inequality evolved in recent decades among global citizens? We provide the first estimates of how the growth in global income since 1980 has been distributed across the totality of the world population. The global top 1% earners has captured twice as much of that growth as the 50% poorest individuals. The bottom 50% has nevertheless enjoyed important growth rates. The global middle class (which contains all of the poorest 90% income groups in the EU and the United States) has been squeezed.

To underscore the last statement, here is the elephant curve of income growth from 1980-2016 for just the US-Canada and Western Europe (Figure 2.1.2):

figure-212

Note how in this chart, without China and India, the left side is flat, indicating that the lower economic classes have only had average or negligible income growth.

How did this translate into shares of growth captured by different groups? The top 1% of earners captured 28% of total growth—that is, as much growth as the bottom 81% of the population. The bottom 50% earners captured 9% of growth, which is less than the top 0.1%, which captured 14% of total growth over the 1980–2016 period. These values, however, hide large differences in the inequality trajectories followed by Europe and North America. In the former, the top 1% captured as much growth as the bottom 51% of the population, whereas in the latter, the top 1% captured as much growth as the bottom 88% of the population. (See chapter 2.3 for more details.)

It is noteworthy that the closer to the top, the higher the cumulative income growth, especially in the US. For example, Table 2.4.2 below shows that since 1980, US income has more than

  • doubled for the Top 10% (growth = 121%)
  • tripled for the Top 1% (204%)
  • quadrupled for the Top 0.1% (320%)
  • quintupled for the Top 0.01% (453%) and
  • septupled for the Top 0.001% (636%)

 

table-242

Another interesting finding from this is that pre-tax US income for the bottom 50% has essentially remained unchanged (growth = 1%) for an entire generation, with the bottom 20% even seeing their income shrink by 25%. Economic policies which exclude large portions of the population from growth for an entire generation are bound to increase tensions within that population, here primarily along the lines of economic class boundaries.

Example 3: Geographic breakdown of global income groups

In Part 2 the report looks at the share of Africans, Asians, Americans and Europeans in each of the global income groups and how this has changed over the last few decades. To illustrate, there are two snapshots in time, first at 1990 (Figure 2.1.5)

figure-215

and then at 2016 (Figure 2.1.6):

figure-216

Comparing these two area charts reveals a few interesting developments at the level of entire geographic regions:

In 1990, Asians were almost not represented within top global income groups. Indeed, the bulk of the population of India and China are found in the bottom half of the income distribution. At the other end of the global income ladder, US-Canada is the largest contributor to global top-income earners. Europe is largely represented in the upper half of the global distribution, but less so among the very top groups. The Middle East and Latin American elites are disproportionately represented among the very top global groups, as they both make up about 20% each of the population of the top 0.001% earners. It should be noted that this overrepresentation only holds within the top 1% global earners: in the next richest 1% group (percentile group p98p99), their share falls to 9% and 4%, respectively. This indeed reflects the extreme level of inequality of these regions, as discussed in chapters 2.10 and 2.11. Interestingly, Russia is concentrated between percentile 70 and percentile 90, and Russians did not make it into the very top groups. In 1990, the Soviet system compressed income distribution in Russia.

In 2016, the situation is notably different. The most striking evolution is perhaps the spread of Chinese income earners, which are now located throughout the entire global distribution. India remains largely represented at the bottom with only very few Indians among the top global earners.

The position of Russian earners was also stretched throughout from the poorest to the richest income groups. This illustrates the impact of the end of communism on the spread of Russian incomes. Africans, who were present throughout the first half of the distribution, are now even more concentrated in the bottom quarter, due to relatively low growth as compared to Asian countries. At the top of the distribution, while the shares of both North America and Europe decreased (leaving room for their Asian counterparts), the share of Europeans was reduced much more. This is because most large European countries followed a more equitable growth trajectory over the past decades than the United States and other countries, as will be discussed in chapter 2.3.

There are, of course, many more findings in this report. It is great to see that such rigorous data-driven analysis is made available free of charge and easy to consume (desktop, iPad, etc.). One can hope that such foundational work will lead to a more educated civic discussion about the current status of economic inequality, the impact of various policy tools as well as the geographic developments on these inequalities.

 
Leave a comment

Posted by on February 16, 2018 in Socioeconomic

 

Tags: , ,

2012 Election Result Maps

2012 Election Result Maps

The New York Times has covered the 2012 U.S. presidential election in great detail, including the much heralded fivethirtyeight Blog (after the 538 electoral votes) by forecaster Nate Silver. His poll-aggregation model has consistently produced the most accurate forecasts, and called 99 of 100 states correctly in both the 2008 and the 2012 elections.

A popular visualization is the map of the 50 states in colors red (Republican) and blue (Democrat) plus green (Independent). Since most states allocate all their electoral votes to the candidate with the most votes in that state, this state map seems the most important.

2012 Election Result By State (Source: NYTimes.com)

This map hardly changed from 2008, only Indiana and North Carolina changed color. Hence the electoral vote result in 2012 (332 Dem206 Rep)  is similar to that of 2008 (365 Dem173 Rep). The visual perception of this map, however, is that there is roughly the same amount of red and blue, with slightly more red than blue. This perception becomes even stronger when looking at the results by county.

2012 Election Results By County (Source: NYTimes.com)

Why is the outcome so strongly in favor of the blue (Democrat) when it looks like the majority of the area is red? The answer is found in very uneven population density of the 50 states. Although roughly the same size, California’s (slightly more blue) population density is about 40x higher than Montana’s (mostly red). On the extreme end of this scale, the most densely populated state New Jersey has about 1000x as many people living per square mile as the least densely populated state Alaska. Urban areas have a much higher density of voters than rural areas. The different demographics are such that urban areas tend to vote more blue (Democrat), rural areas tend to vote more red (Republican). The size of the colored area in the above chart would only be a good indicator if the population density was uniform. A great way to compensate visually for this difference can be seen in the third chart published by the NYTimes.

2012 Election Delta By County (Source: NYTimes.com)

Now the size of the colored circles is proportional to the number of surplus votes for that color in that county. The few blue circles around most major cities are larger and outweigh the many small red circles in rural areas – both optically intuitive and numerically in total. The original map is interactive, giving tooltips when you hover over the circles. For example, in just Los Angeles county there were about 1 million more blue (Democrat) votes than red (Republican).

2012 Election in Los Angeles County

This optical summation leads to intuitively correct results for the popular votes. The difference in popular vote was about 3.5 million more blue (Democrat) votes or roughly 3%. We see more blue in this delta circle diagram.

Of course, the president is not elected by the popular, but by the electoral votes per state. So no matter how big the Democrat advantage in California may be, there won’t be more than the 55 electoral votes for California. This winner-take-all dynamic of electoral votes by state leads to the outsized influence of swing states which are near the 50%-50% mark on the popular votes. A small lead in the popular vote can lead to a large gain in electoral votes. In extreme cases, a candidate can win the electoral vote and become president despite losing in the popular vote (as happened in 2000 and the very narrow win of Florida by George W. Bush).

Another variation on this theme of visually combining votes and population density information comes from Chris Howard. (This was referenced in an article on theatlanticcities.com by Emily Badger on the spatial divide of urban vs. rural voting preferences which has other election maps as well). The idea is to use shades of blue and red with population density increasing in darker shades of the color, used on a by county map.

2012 Election by county with shading by population density (Source: Chris Howard)

A final visualization comes from Nate Silver’s Blog post on November 8. While the % details of this at the time preliminary result may be slightly off (not all votes had been counted yet), the electoral vote counts remain valid.

2012 Election By State Cumulative (Source: Fivethirtyeight Blog)

It shows which swing state [electoral votes] put the blue ticket over the winning line (Colorado [9]) and which other swing states could have been lost without losing the presidency (Florida [29], Ohio [18], Virginia [13]). It also gives a crude, but somewhat telling indication of where you might want to live if you want to surround yourself by people with blue or red preferences.

 
Leave a comment

Posted by on November 15, 2012 in Socioeconomic

 

Trends in Health Habits across the United States

Trends in Health Habits across the United States

This week Scientific American published an interesting article about trends in health habits across the United States. The article includes both a large composite chart as well as a page with an interactive chart. Both are well done and a great example of using a visualization to help telling a story. I personally find the most useful part of the graphic to be the comparison column on the right with shades of color indicating degree of improvement (blue) or deterioration (red).

US health habits 1995 vs. 2010 (Source: Scientific American)

From the article:

Americans are imbibing alcohol and overeating more yet are smoking less (black lines in center graphs).

Some of the behaviors have patterns; others do not. Obesity is heaviest in the Southeast (2010 maps). Smoking is concentrated there as well. Excess drinking is high in the Northeast.

Comparing 2010 and 1995 figures provides the greatest insight into trends (maps, far right). Heavy drinking has worsened in 47 states, and obesity has expanded in every state. Tobacco use has declined in all states except Oklahoma and West Virginia. The “good” habit, exercise, is up in many places—even in the Southeast, where it has lagged.

A more detailed visual analysis is possible using the interactive version of these graphs on the related subpage Bad Health Habits are on the rise. Here one can compare up to three arbitrary states against top, median, and bottom performing states by health habit.

The following examples show tobacco use, exercise and obesity by state with line charts for the three arbitrarily selected states of Florida, California and Hawaii.

Tobacco Trend By State

Exercise Trend By State

Obesity Trend By State

Leading the exercise statistics are citizens in states offering attractive outdoor sports opportunities, like Oregon or Hawaii. Such correlation seems intuitive in both causal directions: People interested in exercise tend to move to those states with the most attractive outdoor sports. And people living in those states may end up exercising more due to the opportunity.

When looking at the average trend line, exercise seems to have leveled off after a bump in the early 2000’s, whereas the decline in smoking over the last decade continues unabated.

15 years is half a generation. During that time, Americans have in almost every state smoked less, exercised more in many states, but obesity is sharply on the rise in every state! From a health and policy debate the latter seems to be the most alarming trend. Most people want the next generation to be better off than the previous one. This has to some extent been true with wealth, at least until the great recession of 2008. But these data show that at population levels, more wealth is not necessarily more health.

 
Leave a comment

Posted by on October 19, 2012 in Medical

 

Inequality and the World Economy

Inequality and the World Economy

The last edition of The Economist featured a 25-page special report on “The new politics of capitalism and inequality” headlined “True Progressivism“. It is the most recommended and commented story on The Economist this week.

We have looked at various forms of economic inequality on this Blog before, as well as other manifestations (market share, capitalization, online attention) and various ways to measure and visualize inequality (Gini-index). Hence I was curious about any new trends and perhaps ways to visualize global economic inequality. That said, I don’t intend to enter the socio-political debate about the virtues of inequality and (re-)distribution policies.

In the segment titled “For richer, for poorer” The Economist explains.

The level of inequality differs widely around the world. Emerging economies are more unequal than rich ones. Scandinavian countries have the smallest income disparities, with a Gini coefficient for disposable income of around 0.25. At the other end of the spectrum the world’s most unequal, such as South Africa, register Ginis of around 0.6.

Many studies have found that economic inequality has been rising over the last 30 years in many industrial and developing nations around the world. One interesting phenomenon is that while the Gini index of many countries has increased, the Gini index of world inequality has fallen. This is shown in the following image from The Economist.

Global and national inequality levels (Source: The Economist)

This is somewhat non-intuitive. Of course the countries differ widely in terms of population size and level of economic development. At a minimum it means that a measure like the Gini index is not simply additive when aggregated over a collection of countries.

Another interesting chart displays a world map with color coding the changes in inequality of the respective country.

Changes in economic inequality over the last 30 years (Source: The Economist)

It’s a bit difficult to read this map without proper knowledge of the absolute levels of inequality, such as we displayed in the post on Inequality, Lorenz-Curves and Gini-Index. For example, a look at a country like Namibia in South Africa indicates a trend (light-blue) towards less inequality. However, Namibia used to be for many years the country with the world’s largest Gini (1994: 0.7; 2004: 0.63; 2010: 0.58 according to iNamibia) and hence still has much larger inequality than most developed countries.

World Map of national Gini values (Source: Wikipedia)

So global Gini is declining, while in many large industrial countries Gini is rising. One region where regional Gini is declining as well is Latin-America. Between 1980-2000 Latin America’s Gini has grown, but in the last decade Gini has declined back to 1980 levels (~0.5), despite the strong economic growth throughout the region (Mexico, Brazil).

Gini of Latin America over the last 30 years (Source: The Economist)

Much of the coverage in The Economist tackles the policy debate and the questions of distribution vs. dynamism. On the one hand reducing Gini from very large inequality contributes to social stability and welfare. On the other hand, further reducing already low Gini diminishes incentives and thus potentially slows down economic growth.

In theory, inequality has an ambiguous relationship with prosperity. It can boost growth, because richer folk save and invest more and because people work harder in response to incentives. But big income gaps can also be inefficient, because they can bar talented poor people from access to education or feed resentment that results in growth-destroying populist policies.

In other words: Some inequality is desirable, too much of it is problematic. After growing over the last 30 years, economic inequality in the United States has perhaps reached a worrisome level as the pendulum has swung too far. How to find the optimal amount of inequality and how to get there seem like fascinating policy debates to have. Certainly an example where data visualization can help an otherwise dry subject.

 
1 Comment

Posted by on October 15, 2012 in Socioeconomic

 

Tags: , , ,

Olympic Medal Charts

Olympic Medal Charts

The 2012 London Olympic Games ended this weekend with a colorful closing ceremony. Media coverage was unprecedented, with other forms of competition around who had the most social media presence or which website had the best online coverage of the games.

In this post I’m looking at the medal counts over the history of the Olympic Games (summer games only, 27 events over the last 116 years, no games in 1916, 1940, and 1944). Nearly 11.000 athletes from 205 countries competed for more than 900 medals in 302 events. The New York Times has an interactive chart of the medal counts on their London 2012 Results page:

Bubble size represents the number of medals won by the country, bubble position is roughly based on a world map and bubble color indicates the continent. Moving the slider to a different year changes the bubbles, which gives a dynamic grow or shrink effect.

Below this chart is a table listing all gold, silver, bronze winners for each sport in that year, grouped by type of sport such as Gymnastics, Rowing or Swimming. Selecting a bubble will filter this to entries where the respective country won a medal. This shows the domination of some sports by certain countries, such as Diving (8 events, China won 6 gold and 10 total medals) or Cycling – Track (10 events, Great Britain won 7 gold and 9 total medals). In two sports, domination by one country was 100%: Badminton (5 events, China won 5 gold and 8 total medals), Table Tennis (4 events, China won 4 gold and 6 total medals).

There is also a summary table ranking the countries by total medals. For 2012, the United States clearly won that competition, winning more gold medals (46) than all but 3 other countries (China, Russia, Britain) won total medals.

Top 10 countries for medal count in 2012

Of course countries vary greatly by population size. It is remarkable that a relatively small nations such as Jamaica (~2.7 million) won 12 medals (4, 4, 4), while India (~1.25 billion) won only 6 medals (0, 2, 4). In that sense, Jamaica is about 1000x more medal-decorated per population size than India! In another New York Times graphic there is an option to compare medal count adjusted for population size, i.e. with the medal count normalized to a standard population size of say 100 million.

Directed graph comparing medal performance adjusted for country size

Selecting any node in this graph will highlight countries with better, worse or comparable relative medal performance. (There are different ways to rank based on how different medals are weighted.)

The Guardian Data Blog has taken this a step further and written a piece called “alternative medals table“. This post not only discusses multiple factors like population, GDP, or number of athletes and how to deal with them statistically; it also provides all the data and many charts in a Google Docs spreadsheet. One article combines GDP adjustment with cartographical mapping across Europe:

Medals GDP Adjusted and mapped for Europe

If you want to do your own analysis, you can get the data in shared spreadsheets. To do a somewhat more historic analysis, I used a different source, namely Wolfram’s curated data source accessible from within Mathematica. Of course, once you have all that data, you can examine it in many different directions. Did you know that 14853 Olympic medals were awarded so far in 27 summer Olympiads? The average was 550 medals, growing about 29 medals per event with nearly 1000 awarded in 2008 and 2012.

A lot of attention was paid to who would win the most medals in London. China seemed in contention for the top spot, but in the end the United States won the most medals, as it did in the last 5 Olympiads. Only 7 countries won the most medals at any Olympiad. Greece (1896), France (1900), the United Kingdom (1908), Sweden (1912), and Germany (1936) did so just once. The Soviet Union (which no longer exists) did it 8 times. And the United States did it 14 times. China, which is only participating since 1984, has yet to win the most medals of any Olympiad.

Aside from the top rank, I was curious about the distribution of medals over all countries. Both nations and events have increased, as is shown in the following paired bar chart:

Number of participating nations and total medals per Summer Games

The number of nations grew steadily with only two exceptions during the thirties and the seventies; presumably due to economic hardship many nations didn’t want to afford participation. 1980 also saw the Boycott of the Moscow Games by the United States and several other delegations over geopolitical disagreements. At just over 200 the number of nations seems to have stabilized.

The number of medals depends primarily on the number of events at each Olympiad. This year there were 302 events in 26 types of Sports. Total medal count isn’t necessarily exactly triple that since in some events there could be more than 1 Bronze (such as in Judo, Taekwondo, and Wrestling). Case in point, in 2012 there were 968 medals awarded, 62 more than 3 * 302 events.

What is the distribution of those medals over the participating nations? One measure would be the percentage of nations winning at least some medals. Another measure showing the degree of inequality in a distribution is the Gini index. Here I plotted the percentage of nations medaling and the Gini index of the medal distribution over all participating nations for every Olympiad:

Percentage and Gini-Index of medal distribution by nations

Up until 1932 3 out of 4 nations won at least some medals. Then the percentage dropped down to levels around 40% and lower since the sixties. That means 6 of 10 nations go home without any medals. During the same time period the inequality grew from Gini of about .65 to near .90 One exception were the Third Games in 1904 in St. Louis. With only 13 nations competing the United States dominated so many sports to yield an extreme Gini of .92 All of the last five Games resulted in a Gini of about .86, so this still very large amount of medal winning inequality seems to have stabilized.

It would be interesting to extend this to the level of participating athletes. Of course we know which athlete ranks at the top as the most decorated Olympic athlete of all time: Michael Phelps with 22 medals.

 
1 Comment

Posted by on August 15, 2012 in Recreational

 

Tags: , , , , , ,

Visualign Blog – View Stats for first year and a half

Visualign Blog – View Stats for first year and a half

I started this Data Visualization Blog back at the end of May 2011. WordPress provides decent analytics to measure things like views, referrer, clicks, etc. The built-in stats show bar charts by day/week/month, views by country, top posts and pages, search engine terms, comments, followers, tags and so on. I have accumulated the view data and wanted to share some analysis thereof.

At this point there are 17,000 views and 56 posts (about 1 post per week). The weekly views have grown as follows:

Weekly Views of Visualign Blog

The WordPress dashboard for monthly views looks like this:

Assuming an exponential growth process this amounts to a doubling roughly every 3 months. This may not sound like much, but if it were to continue, it would lead to a 16x increase per year or a 4096x increase in 3 years. Throughout the first year this model has been fairly accurate and allowed to predict when certain milestones would be reached (such as 10k views, reached in Apr-2011 or 100k views, predicted by Jan-2013).

However, the underlying process is not a simple exponential growth process. Instead it is the result of multiple forces, some increasing, some decreasing, such as level of interest of fresh content for target audience, rather short half-life of web content, size of audience, frequency of emails or tweets with links to the content etc. So I expect growth to slow down and consequently the 100k views milestone to be pushed out past Jan-2013.

Views come from some 112 countries, albeit very unevenly distributed.

Views by Country (10244 views since Feb-25, 2012)

The Top 2 countries (United States and United Kingdom) contribute nearly half of the views, the Top 10 (9%) countries nearly 75% of all views. The fairly high Gini index of this distribution (~0.83) indicates strong dependency on just a few countries. The only surprise for me in the Top 10 list was South Korea, ranking fifth and slightly ahead of India. Germany is probably a bit over-represented due to my German business partner (RapidBusinessModeling) and related network.

Views by country with Top 10 list

One interesting analysis comes from looking at the distribution of views over weekdays. Not every weekday is the same. Thursdays are the busiest, Saturday the quietest days. After a little more than one year, averaging over some 56 weeks, the distribution looks like this.

Weekday variation of Blog views averaged over 1st year

Of course, time zone boundaries may cause some distortions here, but it looks like the view activity builds during the week until it hits a peak on Thursday. Then it falls sharply to a low on Saturday, and builds from there again. This fits with intuition: One would expect the weekend days to be low as well as Monday and Friday to be lower than the mid-week days. It’s tempting to correlate that with the amount of work or research getting done by professionals. The underlying assumption is that people discover or revisit my Blog when it fits into their work.

A large fraction (> 65%) of referrals comes from search engines. Within those, it’s mostly Google (>90% summed across many countries) with just a small amount of others like Bing. It’s safe to say that without Google search my Blog would have practically no views. Chances are that your first exposure to this Blog came from a Google search as well. One unexpected insight for me was to see a high ratio of image to text searches, typically 3:1 or 4:1. In some ways it shouldn’t be surprising that a blog on data visualizations gets discovered more often by searching for visual elements than for text. It also jibes with the enormous growth of image related sites such as Instagram or Pinterest. I just would not have expected the ratio to be that high.

The beginning is always slow. But any exponential growth sooner or later leads to rather large numbers. So the real question is how one can keep the exponential growth process going? I’d love to hear your comments. If you want to compare this against your own Blog stats, I have shared the underlying data as a Google doc here. I have no idea how this compares to other blog stats in similar domains. If you know of any other public Blog stats analysis, please comment with a pointer below. Thanks.

Addendum 7/11/2012: Today my Blog reached 20,000 views. I noticed over the last few weeks that the deviation from an exponential growth model was getting quite large. For an exponential trend line R² = 0.9886.

Daily views with 20,000 total view milestone

When instead modeling the weekly views on a linear growth rate, this gives the total views a quadratic growth. Curve fitting the total views with a 2nd order polynomial yields a very good fit (R² = 0.9977).

Total views growth curve with quadratic curve fit

Linear growth of weekly views is compatible with approximately linear increase in content (steady frequency of about 1 post / week) and thus increased chance of Google search indexing new content (with Google search the main source of view traffic). Quadratic growth of total views is also nonlinear, but far slower than exponential growth. For example, the 100,000 view milestone is now projected to be reached in 08/2013 instead of in 01/2013, i.e. in 13 months as compared to 7 months.

Addendum 11/1/2012: The Blog reached 30000 views on Oct-19 and here is a chart of the monthly views through Oct-2012:

Monthly Blog views through Oct-2012

August and September have been slow, presumably seasonal variation. I also didn’t post between late August and mid October. The view data of the last couple of months no longer support the theory of significant growth in view frequency. Instead, multiple dynamic factors come into play. At times views spike due to a mention or a post of temporary interest – such as the recent post on visualizing superstorm Sandy. But such spikes quickly fade away according to the very limited half-life of web information these days. The undulating 4 week trailing average in weekly views below visualizes this clearly. The net effect has been a plateau in view frequency around 3000 per month.

Weekly Views with average Nov 2012

I continue to see most of the referrals coming from Google searches, still with a majority of those being image searches. Engagement growth has been anemic, with relatively few comments, back links or other forms of engagement. It seems to me that growth proceeds in phases, with growth spurts interspersed by plateaus of varying length. One such growth spurt has been reported by Andrei Pandre on his Data Visualization Blog through the use of Google+. Perhaps it’s time to extend this Blog to Google+ as well.

Variation of views by weekday

With regard to variation of views by weekday, the qualitative pattern remains. Tuesday is now emerging as the day with the most views, with Monday, Wednesday, and Thursday slightly behind, but still above average. Friday is slightly below average, Saturday is the lowest day with only half the views and Sunday in between.

I’m not sure whether to conclude from that that important posts should be published on a particular weekday. Again, most views come from Google searches and are accumulated over time, so perhaps only the height of the initial spike will vary somewhat based on the publishing weekday.

 
Leave a comment

Posted by on June 12, 2012 in Scientific

 

Inequality Comparison

Inequality Comparison

In previous posts on this Blog we have looked at various inequalities as measured by their respective Gini Index values. Examples are the posts on Under-estimating Wealth Inequality, Inequality on Twitter, Inequality of Mobile Phone Revenue, and how to visualize as well as measure inequality.

Here is a bubble chart comparison of 14 different inequalities:

Comparison of various Inequalities

 

Legend:

  • P1: Committee donations to 2012 presidential candidates (2011, Federal Election Commission)
  • P2: US political donations to members of congress and senate (2010, US Center for Responsive Politics)
  • A1: Twitter Followers (of my tlausser account) (2011, Visualign)
  • A2: Twitter Tweets (of my tlausser account) (2011, Visualign)
  • I1: Global Share of Tablet shipment by Operating System (2011, Asymco.com)
  • I2: Mobile Phone Shipments (revenue) (2009, Asymco.com)
  • I3: US Car Sales (revenue) (2011, WSJ.com)
  • I4: Market Cap of Top-20 Nasdaq companies (2011, Nasdaq)
  •  

    The x-axis shows the size of the population in logarithmic scale. The y-axis is the Gini value. The “80-20 rule” corresponds to a Gini value of 0.75. Bubble size is proportional to the log(size), i.e. redundant with the x-axis.

    Discussion:

    Most of the industrial inequalities studied have a small population (10-20); this is usually due to the small number of competitors studied or a focus on the Top-10 or Top-20 (for example in market capitalization). With small populations the Gini value can vary more as one outlier will have a disproportionately larger effect. For example, the Congressional Net Worth analysis (top-left bubble) was taken from a set of 25 congressional members representing Florida (Jan-22, 2012 article in the Palm Beach Post on net worth of congress). Of those 25, one (Vern Buchanan, owner of car dealerships and other investments) has a net worth of $136.2 million, with the next highest at $6.4 million. Excluding this one outlier would reduce the average net worth from $6.9 to $1.55 million and the Gini index from 0.91 (as shown in the Bubble Chart) to 0.66. Hence, Gini values of small sets should be taken with a grain of salt.

    The studied cases in attention inequality have very high Gini values, especially for the traffic to websites (top-right bubble), which given the very large numbers (Gini = 0.985, Size = 1 billion) is the most extreme type of inequality I have found. Attention in social media (like Twitter) is extremely unevenly distributed, with most of it going to very few alternatives and the vast number of alternatives getting practically no attention at all.

    Political donations are also very unevenly distributed, considerably above the 80-20 rule. The problem from a political perspective is that donations buy influence and such influence is very unevenly distributed, which does not seem to be following the democratic ideals of the one-person, one-vote principle of equal representation.

    Lastly, economic inequalities (wealth, income, capital gains, etc.) are perhaps the most discussed forms of inequality in the US. Inequalities at the level of all US households or citizens measure large populations (100 – 300 million). One obvious observation from this Bubble Chart is that capital gains inequality is far, far higher than income inequality.

    Tool comment: I have used Excel 2007 to collect the data and create this chart. Even though it is natively supported in Excel, the Bubble Chart has a few restrictions which make it cumbersome. For example, I haven’t found a way to use Data Point labels from the spread-sheet; hence a lot of manual editing is required. I also don’t know of a way to create animated Bubble-Charts (to follow the evolution of the bubbles over time) similar to those at GapMinder. Maybe I need to study the ExcelCharts Blog a bit more… If you know of additional tips or tweaks for BubbleCharts in Excel please post a comment or drop me a note. Same if you are interested in the Excel spread-sheet.

     
    Leave a comment

    Posted by on February 3, 2012 in Industrial, Socioeconomic

     

    Tags:

    Inequality on Twitter

    Inequality on Twitter

    A lot has been written about economic inequality as measured by distribution of income, wealth, capital gains, etc. In previous posts such as Inequality, Lorenz-Curves and Gini-Index or Visualizing Inequality we looked at various market inequalities (market share and capitalization, donations, etc.) and their respective Gini coefficients.

    With the recent rise of social media we have other forms of economy, in particular the economy of time and attention. And we have at least some measures of this economy in the form of people’s activities, subscriptions, etc. Whether it’s Connections on LinkedIn, Friends on FaceBook, Followers on Twitter – all of the social media platforms have some social currencies for attention. (Influence is different from attention, and measuring influence is more difficult and controversial – see for example the discussions about Klout-scores.)

    Another interesting aspect of online communities is that of participation inequality. Jakob Nielsen did some research on this and coined the well-known 90-9-1 rule:

    “In most online communities, 90% of users are lurkers who never contribute, 9% of users contribute a little, and 1% of users account for almost all the action.”

    The above linked article has two nice graphics illustrating this point:

    Illustration of participation inequality in online communities (Source: Jakob Nielsen)

    As a user of Twitter for about 3 years now I decided to do some simple analysis, wondering about the degrees of inequality I would find there. Imagine you want to spread the word about some new event and send out a tweet. How many people you reach depends on how many followers you have, how many of those retweet your message, how many followers they have, how many other messages they send out and so on. Let’s look at my first twitter account (“tlausser”); here are some basic numbers of my followers and their respective followers:

    Followers of tlausser Followers on Twitter

    Some of my followers have no followers themselves, one has nearly 100,000. On average, they have about 3600 followers; however, the total of about 385,000 followers is extremely unequally distributed. Here are three charts visualizing this astonishing degree of inequality:

    Of 107 followers, the top 5 have ~75% of all followers that can be reached in two steps. The corresponding Gini index of 0.90 is an example of extreme inequality. From an advertising perspective, you would want to focus mostly on getting these 5% to react to your message (i.e. retweet). In a chart with linear scale the bottom half does barely register.

    Most of my followers have between 100-1000 followers themselves, as can be seen from this log-scale Histogram.

    What kind of distribution is the number of followers? It seems that Log[x] is roughly normal distributed.

    As for participation inequality, let’s look at the number of tweets that those (107) followers send out.

    Some of them have not tweeted anything, the chattiest has sent more than 16,000 tweets. On average, each follower has 1280 tweets; the total of 137,000 tweets is again highly unequally distributed for a Gini index of 0.77.

    The top 10 make up about 2/3 of the entire conversation.

    Again the bottom half hardly contributes to the number of tweets; however, the ramp in the top half is longer and not quite as steep as with the number of followers. Here is the log-scale Histogram:

    I did the same type of analysis for several other Twitter Users in the central range (between 100-1000 follower). The results are similar, but certainly not yet robust enough to statistical sampling errors. (A larger scale analysis would require a higher twitter API limit than my free 350 per hour.)

    These preliminary results indicate that there are high degrees of inequality regarding the number of tweets people send out and even more so regarding the number of followers they accumulate. How many tweets Twitter users send out over time is more evenly distributed. How many followers they get is less evenly distributed and thus leads to extremely high degrees of inequality. I presume this is caused in part due to preferential attachment as described in Barabasi’s book “Linked: The new science of networks“. Like with all forms of attention, who people follow depends a lot on who others are following. There is a very long tail of small numbers of followers for the vast majority of Twitter users.

    That said, the degree of participation inequality I found was lower than the 90-9-1 rule, which corresponds to an extreme Gini index of about 0.96. Perhaps that’s a sign of the Twitter community having evolved over time? Or perhaps just a sign of my analysis sample being too small and not representative of the larger Twitterverse.

    In some way these new media are refreshing as they allow almost anyone to publish their thoughts. However, it’s also true that almost all of those users remain in relative obscurity and only a very small minority gets the lion share of all attention. If you think economic inequality is too high, keep in mind that attention inequality is far higher. Both are impacting the policy debate in interesting ways.

    Turning social media attention into income is another story altogether. In his recent Blog post “Turning social media attention into income“, author Srininvas Rao muses:

    “The low barrier to entry created by social media has flooded the market with aspiring entrepreneurs, freelancers, and people trying to make it on their own. Standing out in it is only half the battle. You have to figure out how to turn social media attention into social media income. Have you successfully evolved from blogger to entrepreneur? What steps should I take next?”

     
    10 Comments

    Posted by on December 6, 2011 in Industrial, Scientific, Socioeconomic

     

    Tags: , , ,

    Share and Inequality of Mobile Phone Revenues and Volumes

    Share and Inequality of Mobile Phone Revenues and Volumes

    The analyst website Asymco.com visualizes various financial indicators of mobile phone companies in this interactive vendor bubble chart (follow link, select “Vendor Charts”). It covers the following 8 companies: Apple, HTC, LG, Motorola, Nokia, RIM, Samsung, Sony Ericsson. From the “vendor data” tab I downloaded the data and looked at the revenue and volume distributions for the last 4 years.

    Revenue Share of Mobile Phones and corresponding Gini Index

    Note the sharp reduction in inequality of revenue distribution in the 9/1/08 quarter, when Apple achieved nearly 10x in revenue (and volume) compared to the year before. While the iPhone 1 was introduced a year earlier in 2007, in commercial terms the iPhone 3G started to have strong market impact when introduced in the second half of 2008.

    Volume Share of Mobile Phones and Gini Index

    Volume inequality is considerably higher (average Gini = 0.61) than Revenue inequality (0.43) due to two dominant shippers (Nokia and Samsung), which continue to lead the peer group in volume. Only recently has the inequality been reduced, i.e. the volumes are distributed more evenly. Apple’s growth in volume share has come at the expense of other players (mainly Motorola and Sony Ericsson).

    Volume share is a lagging indicator regarding a company’s innovation and success. It can be dominated for a long time by players who are past their prime and in financial distress (like Nokia). Revenue is more useful to predict a company’s future growth and success. But the real story is told when comparing Profit. Apple’s (Smart Phone) Profit dwarfs that of the other 7 competitors:

    Profit Comparison between 8 Mobile Phone Vendors (Source: Asymco.com)

    Click on the image to go to Asymco’s interactive chart (requires Flash). The bubble chart display over time is very revealing regarding Apple’s meteoric rise.

     
    2 Comments

    Posted by on October 22, 2011 in Financial, Industrial

     

    Tags: , ,

     
    %d bloggers like this: