Let AI remove bias: start by building fairer data sets

faddiddn 2021-09-15 10:30:44

In recent days, , Google fired Timnit Gebru——AI Ethics Development Engineer , The problem of algorithm bias has attracted attention again .

Timnit Gebru yes AI A leader in model risk and inequality analysis , She was fired by Google for an unpublished paper . This paper questions : Is the language model too big ? Who will benefit from it ? Whether they increase prejudice and inequality ?

image.png

Timnit Gebru Recently, he was fired by Google

Timnit Gebru Your doubts are not groundless .2016 year , Microsoft's artificial intelligence chat robot Tay go online . However Tay Just started chatting with netizens , I was “ Teach bad ” 了 , Became an anti Semitic 、 sexualgender discrimination 、 Racial discrimination is the same as a whole “ Bad girl ”.

This year, 7 month , MIT was forced to delete 8000 Million Tiny Images Data sets . The data set has been widely used in machine learning models such as image recognition , But it contains racism 、 Hate images of offensive labels such as women .

The MIT website then released a statement , He said he was unaware of the existence of these offensive labels , And 8000 Ten thousand pieces are only 32*32 Pixel images are difficult to manually clean , This leads to discriminatory results .

image.png

MIT Of 8000 Million Tiny Images Forever off the shelf because of discriminatory labels

The same problem occurred in Duke University's PULSE Algorithm . The purpose of this algorithm is to clear the partially blurred face image , However, when experimenting with blurred photos of former US President Barack Obama , But got a white face .

AI experts Yann LeCun Attribute this phenomenon to the deviation of the data set . in other words , Most of the training data sets used in the algorithm are white faces , Therefore, the training results will tilt towards the white face .

image.png

Obama's image is blurred, but it presents a white face

For a long time , There is a misunderstanding about computer technology : Algorithmic decision making is more fair , Because mathematics is about equations , Not skin color .

《 Brief history of mankind 》 The author of a book called this misunderstanding “ Data religion ”—— It is believed that the use of data will become the basis of all decision-making in the future , It is considered that the algorithm can eliminate human bias in decision-making process .

But algorithmic discrimination is not “ A small problem ”, When these discrimination involves credit evaluation 、 Crime risk assessment 、 During major activities such as Employment Evaluation , The result of artificial intelligence decision will affect or even determine the loan amount 、 Penalty options 、 Employment or not , At this time, discrimination is no longer insignificant .

More and more artificial intelligence enterprises and scientific research institutions begin to find effective methods to solve algorithm bias .

Synthesized Previously, it launched a set of open source tools that can quickly identify and eliminate algorithm deviations . The company said , Users only need to upload structured data files to start analyzing their potential gender 、 Age 、 race 、 religious 、 Bias in data attributes such as sexual orientation .

The research team at Princeton University School of engineering has also developed a tool for marking potential deviations in artificial intelligence training image sets . The tool name is REVISE, It uses statistical methods to check the impact of the data set on the target population 、 Underrepresentation of gender and geographical location .

Datahall is the world's leading artificial intelligence data service provider , Always pay attention to strengthening ethical construction . In order to avoid the risk of algorithm deviation , Datahall has developed richer data source types , Designed and made 《23,349 Human multi-color race face multi pose data 》 and 《26,129 Many people, many races 7 Expression recognition data 》. Data collection balances race 、 Skin colour 、 Age 、 Distribution of attributes such as gender , And all have been authorized by the person being collected .

23,349 Human multi-color race face multi pose data

image.png

Example of multi-color human face multi pose data , Has been authorized by the collector

The data includes yellow people 、 black 、 white 、 Brown people and Indians , Each person collects 29 Zhang image , cover 28 Zhang duoguang 、 Multi pose 、 Multi scene pictures and 1 ID photo .

Through to AI At present, there is a lack of human face collection in the industry , The purpose of this data is to improve the feature offset in the algorithm , Improve the accuracy of feature description by user algorithm .

26,129 Many people, many races 7 Expression recognition data

image.png

Multi racial 7 Expression recognition data , Has been authorized by the collector

This data is generated by 17,945 A yellow man 、3,546 A white man 、3,727 A black man 、911 A brown man ( Mexican ) Participate in recording , Among them, men 13,963 people , women 12,166 people . The data diversity covers different facial postures 、 Different expressions 、 Different lighting and different scenes . Subject to the accuracy of expression , The accuracy is more than 97%, The accuracy of expression naming is also 97% above .

Robin Li, founder of Baidu 2018 In Guiyang big data Expo, it was proposed that AI Ethical principles : First of all ,AI The highest principle is safe and controllable . second ,AI Our innovative vision is to promote more equal access to technological capabilities . Third ,AI The value of existence is to teach people to learn , Let people grow , Instead of replacing people 、 Surpassing man . Last ,AI The ultimate ideal is to bring more freedom and possibility to human beings .

Datang always insists on strengthening the construction of technical ethics 、 Adhere to the concept of science and technology for the good . at present , Datahall has accumulated rich experience in multi-color face annotation , It can effectively avoid the algorithm bias caused by the deviation of the data set , Users can rest assured to use .

Please bring the original link to reprint ,thank
Similar articles