Audiocaptcha

AN IMPROVED AUDIO
Submitted By :Swapnil Singh 0816513057 I.T. - IIIrd year
CONTENTS
What are captchas? Problem with current audio captchas. Testing of current captchas. Categories of audio Captcha. Algorithm used and its details. Need for audio reCaptcha. Applications. Pitfalls. Conclusion.

1
WHAT ARE CAPTCHAS?

CAPTCHAs
are tests generated by computers and generally passable by humans but not current computer programs.
THE PROBLEM WITH CURRENT AUDIO CAPTCHAS

In
some cases the human passing rate is only 70%! To make the CAPTCHAs secure, noise was injected into the audio files making it harder for both computers and humans to pass.
ARE CURRENT AUDIO CAPTCHAS SECURE?

A CAPTCHA is
considered broken once a program can pass it 5% of the time. Since the current audio CAPTCHAs use a limited vocabulary, it was possible for us to collect enough data to train a system that could pass the current audio CAPTCHAs more than 45% of the time.
HOW DID WE TEST THE CURRENT AUDIO CAPTCHAs?

Selected
three different types of audio CAPTCHAs: google, reCAPTCHA, and digg Collected 1000 CAPTCHAs per type of audio CAPTCHA to use for training and testing Created an ASR system using machine learning techniques
THREE CATEGORIES OF AUDIO CAPTCHA

reCAPTCHA audio captcha - multiple voices, digits and background noise that is backwards speech Google audio captcha- digits, single voice, backwards speech Digg audio captcha- digits and letters, static/water for noise

THE ALGORITHM
Given
the .wav file of an audio CAPTCHA Segmentation - selecting portions of the audio which most likely are digits/letters Recognition y Extract features from the segment y Classify segment as digit/letter or noise and output the label Stop once a maximum number of segments are classified
ALGORITHM DETAILS SEGMENTATION

CAPTCHAs
were manually labeled and segmented. We created training segments using this information. For testing, we chose the highest energy peaks in the test CAPTCHA and selected fixed size segments roughly centered at the peaks.
QuickTime and a decompressor are needed to see this picture.
ALGORITHM DETAILS - FEATURES

We
used three popular techniques for extracting features from speech to derive 5 sets of features from the audio. y Mel-frequency cepstral coefficients (MFCC) y Perceptual linear prediction (PLP) y Relative spectral transform with PLP (RASTAPLP)
ALGORITHM DETAILS - AdaBoost

Used
decision stumps for weak classifiers For each type of audio CAPTCHA we created enough classifiers to label a segment as a digit, letter, or noise. Created 11 to 37 classifiers Each classifier returns a value which represents its confidence that the segment should be labeled as digit letter or noise.
10
ALGORITHM DETAILS - SVM

Created
a single multiclass classifier using all the training segments (from 900 CAPTCHAs)
11
ALGORITHM DETAILS - k-NN

Created
5 classifiers corresponding to each of the feature sets
12
THE ALGORITHM
Input: Audio
CAPTCHA as an audio file Segmentation y Find the highest energy peak, and extract a fixed size segment centered at that peak Recognition y Extract features from segment y Give segment to classifier and obtain label Stop extracting segments once all segments have been labeled or a max solution size is reached.
13
ANALYSIS OF CURRENT AUDIO CAPTCHAs

Exact Match Rate
Using
three machine learning techniques to 80 70 perform ASR on the 60 CAPTCHAs 50 % 40 y AdaBoost 30 y Support Vector 20 Machines (SVM) 10 y k-Nearest Neighbor 0 (k-NN)
AdaBoost SVM k-NN
GooglereCAPTCHA Digg
14
THE GOAL
Make
a secure audio CAPTCHA which will be easier for a human to pass and harder for a computer to pass. Equate solving a CAPTCHA with doing some useful work. y In other words, create an audio reCAPTCHA.
15
WHAT IS reCAPTCHA?
reCAPTCHA helps
digitize text on which OCR fails by using the text as its CAPTCHA. Since millions of people solve CAPTCHAs each day, millions of words get digitized each day!
16
17
THE AUDIO RECAPTCHA

Takes
advantage of the human ability to understand words through context. Will help transcribe digital audio on which ASR systems fail. The audio being used was originally recorded with the intention that it should be easily understood by humans.
18
APPLICATIONS
Preventing Comment Spam in Blogs. Protecting Website Registration. Protecting Email Addresses From Scrapers. Online Polls Preventing Dictionary Attacks. Worms and Spam.

19
ANALYSIS OF SECURITY
Speaker
independent recognition is difficult. Open vocabularies make it even more difficult for ASR systems AM broadcasts and .mp3 compression cause the loss of important data needed for automatic analysis
20
CONCLUSION
CAPTCHAs
need to be more accessible, yet remain secure and not too difficult for humans. Deploy audio reCAPTCHA through reCAPTCHA site. Help make knowledge captured in audio available in text form
21
Thank you
22

Audiocaptcha

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Audiocaptcha

Enviado por

Direitos autorais:

Formatos disponíveis

AN IMPROVED AUDIO

Submitted By :Swapnil Singh 0816513057 I.T. - IIIrd year

WHAT ARE CAPTCHAS?

THE PROBLEM WITH CURRENT AUDIO CAPTCHAS

ARE CURRENT AUDIO CAPTCHAS SECURE?

HOW DID WE TEST THE CURRENT AUDIO CAPTCHAs?

THREE CATEGORIES OF AUDIO CAPTCHA

ALGORITHM DETAILS SEGMENTATION

QuickTime and a decompressor are needed to see this picture.

ALGORITHM DETAILS - FEATURES

ALGORITHM DETAILS - AdaBoost

ALGORITHM DETAILS - SVM

ALGORITHM DETAILS - k-NN

5 classifiers corresponding to each of the feature sets

ANALYSIS OF CURRENT AUDIO CAPTCHAs

AdaBoost SVM k-NN

THE AUDIO RECAPTCHA

Você também pode gostar