IRIS National Fair

2019

Introduction: Fairer Ranking Systems using Linear Regression and Cumulative Binomial Distribution

Every four years, we see tables like these, depicting the number of medals won by each country at the Olympics, and ranking them accordingly. However, this system is completely arbitrary on two accounts. Firstly, it ranks the countries according to the number of gold medals won, without taking into consideration the silver and bronze medals. Secondly, the countries with a larger population have a higher chance of appearing near the top, so the smaller countries get little to no recognition.

One proposed alternate ranking system is the medals-per-capita (MPC) ranking system. In this system, a country’s score is determined by medals won per population. However, this system is arbitrary too. For example, in the 2012 Olympics, Grenada won the event according to the MPC rankings, even though it won just 1 medal. For the United States to pull ahead of Grenada, it would have to win more than 2800 medals, more than thrice the number awarded in the entire Olympics.

This unfairness prompted us to undertake this project. Our primary objective was to develop a ranking system which was fairer by incorporating mathematical concepts of statistics and probability, more specifically linear regression and the cumulative binomial distribution function. We hypothesized that such a system would prove to be better than the ones currently in use because it uses mathematical concepts, and, thus, is not completely arbitrary.

Method

In our pursuit to create a better ranking system, we looked at numerous mathematical concepts. After some research, all the different formulae and methods were compiled for testing. This list included methods from various parts of statistics, ranging from probabilistic formulae like the probability mass function to Bayesian statistical modeling. Following rigorous testing, two methods were finalized - cumulative binomial distribution and linear regression.

Testing of the compiled methods was conducted on a small data set collected from the recently concluded Asian Games. This data set consisted of both large and small countries in order to see how well the method performed in giving equal recognition. After this test, it was observed that the best results were given by the cumulative binomial distribution function and linear regression. The former yielded more absolute results while the latter gave more relative results.

Now that these two methods had been finalized, official data from the last three Summer Olympics was collected. This included all the countries which had participated and won medals. Information about population, gross domestic product, nutrition and healthcare services was also collected in order to make this system fairer.

Primarily, Microsoft Excel was used in order to make calculations and organize and visualize data. This was because this software had a range of built-in functions which made the otherwise rigorous calculations much easier. Two spreadsheets, each consisting of 4 sub-sheets for the four factors, were created for each of the Olympics. The results found were then accumulated and put in a formula to output a score for each country. The results found using the cumulative binomial distribution function and linear regression were compared in order to verify the accuracy of our system.

The data was divided into 4 sheets, so as to include the maximum number of factors. These factors were population, gross domestic product (GDP), healthcare services, and nutrition. Each of these sheets was divided into 3 tables, one for the probability of winning as many golds, one for the probability of winning as many golds and silvers, and one for the probability of winning as many medals as it did. This was done in order to give gold medals a higher weightage, followed by silvers and then bronze medals. After the aforementioned data had been organized, a BINOMDIST function was used to calculate the cumulative probabilities. In Microsoft Excel, a BINOMDIST function is defined as: =BINOMDIST(number_s, trials, probability_s, cumulative). In our case, number_s was defined as the number of successes, i.e. the number of medals won by the country. Trials was defined as the number of medals that were given and probability_s was defined as the probability of success, i.e. the percentage of world population/GDP the country had. TRUE was inputted into the cumulative so as to obtain an answer based on cumulative binomial probability, this was because inputting FALSE would return an answer as per the probability mass function. After the values of a country from all the 3 tables were found, they were averaged, to obtain a score for each country. The scores of each country from each of the sheets were then averaged to obtain the final score of the country, the one which would return the score which would determine the final rank of the country.

Like the first method, the data was divided into 4 sheets, one for each of the considered factors. However, instead of having 3 tables, only one table was made, with the total number of medals won by the country, weighted. This data was then graphed, with the percentage of global population/GDP represented on the x-axis, and the number of medals was represented on the y-axis. Each country’s data was shown by a point on the graph, a regression line was plotted, and the equation of the line was calculated. This equation was in the form y1 = a(x1) + c. For the sake of convenience this equation was written in the form c = -a(x1) - b(y1). Then, the deviation of the point was calculated using this formula-

The countries on the right side of the regression line had performed better than average and the countries on the left side had performed worse. The countries with a higher deviation were ranked higher. Again, the deviation scores from all of the sheets were averaged for a final score.

Discussion

Following the completion of the data of the proposed methods, the results are represented in the chart below. Clearly, it can be seen that countries, both small and large, are given recognition by these methods, as compared to the current method.

To compare the method in current use and the two methods proposed by us, graphs were created. On the x-axis was the factor score of the country, which was given by a formula in which all the data for the country’s population, GDP, nutrition and healthcare was inputted. On the y-axis was the rank score, which highlighted the rank of the country in that particular ranking system. The higher the country’s rank, the higher the rank score. By doing this, an efficient method of comparison was developed, where only considering the slope of the trendline will give a measure of the fairness of the system. In Fig.2, it can easily be seen that the slope is inclined upwards, which indicates that, generally, a higher factor score means more medals, which is unfair to the countries with lower factor scores. In the next following two figures, specifically Fig.3, it can be seen that the trendline is much straighter, indicating that both higher and lower factor scores are given almost equal recognition. Thus, our hypothesis was proved correct. Also, the slightly larger slope in Fig.4 is indicative of the fact that linear regression is much more suited to smaller samples.

If this project is to be repeated in the future, a computer program could be designed to make the calculations easier and less time-consuming. This computer program could be implemented at the Olympics for spontaneity in results. Also, many more factors could be taken into consideration, like, for example, interest in sports, allocation of budget to sports, weather, facilities etc. Obviously, in the process of doing this project, each and every method could not be tested. Thus, in the future, the size of the formulae which were used for testing could be expanded upon. Better and more accurate data at the time of the start of the Olympics could be used for conducting research, and surveys can be done for national interest in sports.

Floating point errors were the possible sources of the majority of the errors in this project. These can be avoided by the use of better computer which run programs more and more times to obtain a better estimate of a country’s score. Moreover, data of each country at the exact time of the beginning of the Olympics was not present; the data was off by at least 3 months.

Conclusion

The problem of the unfairness in the current ranking system used in the Olympics and sporting events is a major one. We believe that the alternate ranking systems which have been proposed in the project will completely solve this problem. These methods as much fairer, as these use mathematical concepts, rather than being arbitrary, as has been highlighted repeatedly throughout the project. These methods can be improved upon and advertised, so that the general public understands it better. If these methods can be implemented, we will surely see a much fairer Olympics, one which would level the playing field for all nations, and one which would mark true competition.

2020

Synopsis: Prediction of Global Climate Change using Multivariate Regression and Subsequent Solution Mapping with Superpixel Segmentation and K Means Clustering

In our project, we have predicted global climate change emissions for the year 2020. To achieve this, we have used NASA image datasets of various elements affecting global warming and climate change - CO2, Water Vapour, Cloud Fraction, Aerosol Depth, and Ozone. Upon chronological analysis and subsequent combining, each of these factors has been predicted for the year 2020. Finally, all these factors have been combined to create an innovative and original world climate change emissions map showing each country and regions’ contribution to climate change. This map is used to draw important and interesting conclusions. It is combined with other supposedly unrelated maps to provide an optimum mapping of various proposed solutions by the UN to combat climate change. All of this is done using Python and Computer-Vision and data analysis libraries. Firstly, a regression model predicts each pixel value for the specified period. Then these pixels are combined to produce a final climate change emissions map, converted to a bonne map for ease of analysis; using a simple scale of color gradients, making it efficient and extremely easy to understand. Peaks and dips in values can be detected and subsequently, regions can be identified which need immediate attention. Our map is analyzed chronologically to pick out helpful policies in relation to deltas in values over a region in comparison to the neighboring regions. Furthermore, a sequential convolutional neural networks model has been created which analyses images in sets of {20,20,20} to cross-check the predicted map. This has been made using the Keras library.

Introduction and Objectives

The main aim of this project was to predict the global climate change emissions using NASA image datasets on atmospheric factors affecting global warming like CO2, Ozone, Water Vapour, Cloud Fraction, and Aerosol Depth.{2}{3}

The images in the datasets predicted the level of each factor for the year 2020 using chronological analysis of the images from the year 2004 to 2019 in a succession of quarter-years. All the factors were then combined in proportions to produce a final climate change emissions map for 2020, colored using gradients where each pixel was given a certain value for the areas according to their contribution to climate change.{1}

A CNN model will be trained on the same images with higher frequency to cross-validate with the predictions of the algorithm, and tweak the programs for better accuracies.

This map could now have millions of possible uses. It could be further combined with various other maps like the ‘Global Land Temperature Map’{2} and the ‘Global Land Vegetation Map’{2} to provide an optimized mapping of possible regions to target for various solutions like ‘Heat Capture’ and ‘Direct Air Capture’ respectively.{5}{6} Tweaking some proportions and giving more weightage to certain factors like water vapor in the datasets and help the mapping of specific solutions like ‘Atmospheric Water Generator’.{7}

Moreover, we could analyze the areas where dramatic changes in emissions have occurred, either negative or positive. The specific policies introduced during that certain time period could be compiled and helpful conclusions could be drawn.

Innovation

Chronological analysis of 5 different layers of atmospheric satellite image datasets from modern Bispectral Superlattice Type-II Infrared cameras.{3}

Completely original and efficient python model using linear regression, computer vision, matplolib, and sequential convolutional neural networks.

Combination of factors from the MOPITT (Measurements Of Pollution In The Troposphere) sensor on NASA’s Terra satellite to predict overall global climate change emissions, and not focusing on individual factors.{2}{3}{9}

Use of the Ekholm–Modén (EM) formula, generally used to calculate DMTs (Daily Mean Temperatures), to focus on optimum dates and make the complete model more efficient and up-to-date.{4}

Moreover, official records and methods were used in the calculations of ratios using GAW (Global Atmospheric Watch).{1}

Combining gradients and matching different scale units considering individual effects on the environment. This is done by using appropriate proportions and fine-tuning till perfection.{1}

Making a Sequential Convolutional Neural Network model, using multivariate regression to predict the levels of global warming emissions in the near future to cross-validate with the NumPy algorithm to make predictions with more accuracy.

Using the predicted image to cross-check with government policies and draw conclusions.

Combining the image prediction with other supposedly unrelated maps to give interesting conclusions and mapping for different types of solutions, evidently helped the UN and other agencies in targeting various areas.{2}{5}{6}{7}

Displaying a network of connected regions showing the sequence of targeting on top of a simple pseudo-conic map (a bonne map which preserves area over shape).{12}

Using K Means Clustering, Superpixel Segmentation, and Canny Edge detection to provide this network of regions.

Algorithm

The first step was collecting the 50 images for each factor. These were based off of dates spanned over 15 years, with 3-month frequencies calculated using the EM formula [Tmean=aT07+bT13+cT19+dTmax+eTmin].{4}

These images were then imported and converted to numpy arrays. Each pixel was plotted on a graph of X-Axis [Date] and Y-Axis [RGB value]. Using ‘polyfit(x,y,z)’ and ‘poly1d(z)’ methods, a regression line was plotted, predicting the value of each pixel for the next quarter(2020-Feb). The pixels were recombined into an image prediction of that quarter for that factor.

These 5 images were converted into a universal unit of pixel value and recolored using ‘colorsys’ and ‘skimage’ . These images were then combined in ratios of GAW values for each element - [12:45:120:127:33 ratio effect on temperature].{1} This final image was predicted for stacks of 20 quarters and rerun through a ‘Sequential Neural Network’ along with ‘hstack’,‘Model’,‘Flatten’ and ‘MaxPooling1d’. A resultant image for the year 2020 was the final output.

The current dataset of Land-Vegetation was taken and combined with our final output to propose targeting of optimum regions for Direct-Air-Capture.{5} Land-Temperature datasets were combined for Heat-Capture.{6} Water-Vapour dataset in conjugation with the Land-Vegetation dataset was recombined to propose Water-Capture mapping.{2}{7}

This was done using K Means Clustering. The maps to be combined were segregated into the worst affected regions and separated into regions we wanted to focus on. These were then combined and overlapped to provide the mapping. A superpixel segmentation algorithm with ‘slic’ was used along with a canny edge detection model to maintain continent boundaries.

Peaks and downs in the graphs of the 20 quarter predictions run through the sequential CNN model were matched with recent government policies in those regions. These were done manually through the visual aids of ‘matplotlib’.

Method

Firstly, NASA image datasets for the following atmospheric factors : Cloud Fraction - In addition to providing rain and snow, clouds have a warming and cooling effect. This was analysed on the Okta unit scale of 0.0-1.0.{2} Aerosol - These factors affect infrared light and cloud formation. This was analysed on the Optical Depth Scale of 0.0-1.0.{2} CO2 - It is a key greenhouse gas affecting global warming by trapping heat. This was analysed on the (ppbv) scale of 0-300.{3} Water Vapour - It is the largest contributor to global warming (60%). This was analysed on the ‘cm’ scale of 0.0-6.0.{2}{3} Ozone - It is molecule layer which helps in absorbing UV radiation and maintaining temperature. This was analysed on the Dobson scale of 100-500.{3}

The images for each factor were converted to workable python NumPy arrays and each pixel out of the 12,250,000 total pixels was analyzed chronologically for a time span of 15 years from 2004 - 2019. Using linear regression and computer vision libraries, pixel values for the same were predicted for 2020. These pixel values were inputted in another program which gave a predicted image for every factor.

The predicted image for each factor was then combined using GAW ratios in proportions with variable weightage depending on specific solutions.{1} For this, the different scales were unified and an equivalent colour gradient was created.

This final image was the output of our model and recreated using a suitable universal scale of colour gradients.

Results and Conclusion

The final image that was predicted for the year 2020 showed rise in carbon content by upto 10 ppbv. Average land temperature for 2020 would be .005C more than 2019(0.88C) following previous predicted trends.{8}

Climate change emission map showed significant increase in emissions over Northern and Central Africa. A similar peak is seen on Northern South America, possibly as a result of the recent Amazon fires. A slight increase over Central Asia and Middle Eastern regions was also noted.

A decline in emissions was seen over Central and Western Europe over Denmark, Norway and Finland. A slight decline was also seen in the emissions over North America.{8}

Upon looking at significant peaks during the last 15 years, it was seen that a slight decline had occurred in the pixels representing parts of India and China in the year 2014. This could be seen as a direct result of the EAP (2014-2020). There was also a decline in the area over Switzerland, possibly as a result of successful Direct Air Capture Plants in 2015.{8}{10}{11}

The conclusive part of the model also gave interesting results for the mapping of various solutions. Upon combining vegetation maps with our climate change map, it was seen that the best place to implement Direct Air Capture, a method in which carbon is directly taken embedded into the soil, would be Northern and Central Africa, mainly over the Sahara. Other important regions included arid regions of Russia if an economic factor was included.{5}

References

https://www.elementascience.org/articles/10.12952/journal.elementa.000067/ [Global Atmospheric Watch] [1] https://earthobservatory.nasa.gov/global-maps [Dataset 1] [2] https://neo.sci.gsfc.nasa.gov [Dataset 2] [3] https://rmets.onlinelibrary.wiley.com/doi/full/10.1002/joc.3510 [EM Formula] [4] https://en.wikipedia.org/wiki/Direct_air_capture [Direct Air Capture] [5] https://en.wikipedia.org/wiki/Waste_heat_recovery_unit [Heat Capture] [6] https://en.wikipedia.org/wiki/Atmospheric_water_generator [Atmospheric Water Generator] [7] https://www.world-nuclear.org/information-library/energy-and-the-environment/policy-responses-to-climate-change.aspx [Policy responses] [8] https://en.wikipedia.org/wiki/MOPITT [MOPITT] [9] https://www.sciencemag.org/news/2017/06/switzerland-giant-new-machine-sucking-carbon-directly-air [Swiss Direct Air Capture 2015] [10] https://ec.europa.eu/environment/action-programme/ [EAP 2014-2020] [11] https://en.wikipedia.org/wiki/Map_projection [Pseudo-Conic Map] [12]

* Curly braces {} containing numbers after certain sentences in the above sections are references to these numbered [] links for further reading and context.

Introduction

Projects

Creative Work

Get in Touch

2019

Introduction: Fairer Ranking Systems using Linear Regression and Cumulative Binomial Distribution

Method

Discussion

Conclusion

2020

Synopsis: Prediction of Global Climate Change using Multivariate Regression and Subsequent Solution Mapping with Superpixel Segmentation and K Means Clustering

Introduction and Objectives

Innovation

Algorithm

Method

Results and Conclusion

References

* Curly braces {} containing numbers after certain sentences in the above sections are references to these numbered [] links for further reading and context.

Previous

Next