Você está na página 1de 20

Ahora que ya tenemos la voz, necesitamos un SST (Speech to Text), o Text

Recognition.
De todos los que he probado, me decanto por el de Google. El problema es
que para el uso de ste, deberemos guardar en un wav lo que digamos,
convertirlo a flac, y enviarlo por la api usando un navegador (menudo
pitote).

Despus de unas recopilaciones de aqu y de all lo traigo ya resumido:

Grabamos con el comando "rec" el wav (Aseguraos de que el volumen de la


entrada de micrfono del alsamixer est al 100%, si no le costar la vida
reconocer que estamos diciendo)
rec -r 16000 -e signed-integer -b 16 -c 1 audio.wav trim 0 4
Convertimos de wav a flac:
sox audio.wav -r 16000 -b 16 -c 1 audio.flac vad reverse vad reverse
lowpass -2 2500
Y ahora viene la magia... sin usar el navegador, pero usando una cabecera
(simulndolo):
curl --data-binary @audio.flac --header 'Content-type: audio/x-flac;
rate=16000' 'http://www.google.com/speech-api/v1/recognize?
xjerr=1&client=chromium&lang=es-ES&maxresults=1' 1>audio.txt
Almacenamos en un audio.txt lo que nos devuelve la api de Google.
Podremos cambiar el idioma en lang=es-ES (por defecto puse EspaaEspaol) o si queremos que nos muestre ms resultados en vez de 1 (el ms
acertado).

Para resumir lo que nos devuelve en una simple frase formateamos el


resultado:
FILETOOBIG=`cat audio.txt | grep "<HTML>"`
TRANSCRIPT=`cat audio.txt | cut -d"," -f3 | sed 's/^.*utterance\":\"\
(.*\)\"$/\1/g'`
CONFIDENCE=`cat audio.txt | cut -d"," -f4 | sed 's/^.*confidence\":0.\([0-9]
[0-9]\).*$/\1/g'`
Donde el Confidence (por si queremos usarlo) ser el porcentaje de
probabilidad de acierto en el reconocimiento, y el Transcript lo que entendi.

Ya con esto, procedemos con un simple if la prueba de que funciona:


if echo "$TRANSCRIPT" |grep -q "Hola"; then
aoss espeak -ves "$TRANSCRIPT"
elif echo "$TRANSCRIPT" |grep -q "quin eres"; then
aoss espeak -ves "Soy Yarvis, la mquina de Yuki Sekisan"
elif echo "$TRANSCRIPT" |grep -q "saluda a Alberto"; then
aoss espeak -ves "Hola Alberto, eres muy pesado, vete ya"
elif echo "$TRANSCRIPT" |grep -q "main craft"; then
aoss espeak -ves "Abriendo Maincraft" | java -Xmx1024M -Xms512M -cp
/home/yuki/Escritorio/Minecraft.jar #net.minecraft.LauncherFrame
else aoss espeak -ves "No te entiendo";
aoss espeak -ves "$TRANSCRIPT"
fi
Funciona perfectamente :). Ahora el siguiente paso... Migrarlo a una Base de
datos! Me he decantado por MYSQL

Basically, the library encodes the sample into FLAC using a third party FLAC
library, then issues a request to "https://www.google.com/speechapi/v1/recognize?xjerr=1&client=chromium&lang=en-US" with a ContentType header specifying the format (FLAC) and bitrate (8 kbit).

http://www.codeproject.com/Articles/338010/Fun-with-Google-SpeechRecognition-service

Introduction
I was excited to discover open web services like Google has, and it was very
amazing when I heard about Google speech recognition.
In this article, I write some tips to use Google speech recognition API in Windows
application with direct recording voice from audio input devices. And also, like a
delicious spice - wear simple program for speech recognition into the utility for
quick issues adding in Redmine project.

Background
The basic idea was: you push the button, some timer starts elapse together with
wave-in device opening, main loop starts and pcm data from buffers with your
voice records to file, timer stops and audio file is posted to Google for
recognition.
First task was in understanding flac encoding in realtime, you can tell 'In *nix, I
can write couple commands in terminal and do all: record, encode, post flac file
and receive answer from server. So, why do you not encode file with encoder
program started after recording wave file?' - because it's boring, just imagine:
your program writes already prepared flac audio file!
From the time then I wrote some application for batch converting mp3 files to
OGG/Vorbis, I have stayed library that can encode pcm to vorbis in realtime,
there also was ring buffer for that.
At that point, the appropriate handler for the flac did not wait. You might know
that Google accepts flac in 16 kHz and 16 bit per sample with 1(mono) channel
format. By using example in libflac, I add three
functions:InitialiseEncoder, ProcessEncoder, CloseEncoder which are,
respectively: open file and prepare encoder, upload to encoder 16bit pcm
samples, close file and destroy encoder. One thing: don't understand why it
can't add metadata to flac file? Maybe charset problems?

The wonderful article: WaveLib, which has wave-in API implementation included,
that uses Recorder class: starts theWaveInRecorder and in parallel uses thread
for transmitting pcm data to encoder.

File Uploading
The basic upload function usage is below, change lang parameter optionally:
Collapse | Copy Code

string result = WebUpload.UploadFileEx(flacpath,


"http://www.google.com/speech-api/v1/recognize?lang=ru&client=chromium",
"file", "audio/x-flac; rate=16000", parameters, null);

Response from server is received in JSON format.

Issue Creating
In which case can you use the speech recognition? Maybe for issue creating?
Maybe it is not practical, but certainly funny.
The Redmine web application includes REST web service. By it, we can create
issues as much as we need to, just specify project and tracker, by the way the
list of trackers I could only get younger version 1.3*.
Collapse | Copy Code

RedmineManager manager = new RedmineManager(Configuration.RedmineHost,


Configuration.RedmineUser, Configuration.RedminePassword);
// New ISSUE
var newIssue = new Issue
{
Subject = Title,
Description = Description,
Project = new IdentifiableName() { Id = ProjectId },
Tracker = new IdentifiableName() { Id = TrackerId }
};
// GET ID OF CURRENT USER
User thisuser = (from u in manager.GetObjectList
<user>(new System.Collections.Specialized.NameValueCollection())
where u.Login == Configuration.RedmineUser
select u).FirstOrDefault();
if (thisuser != null)
newIssue.AssignedTo = new IdentifiableName() { Id = thisuser.Id };
manager.CreateObject(newIssue);

Points of Interest
When it was over, I drew attention to record timeout, it gives you 4 secs for your
speech: not for all expressions of it may be appropriate, form maybe needs
some stop button?

Ring buffer will save you from data loss in case of such records directly to flac.
When the data comes from the wave-in, they go in the ring buffer.

Fondo
La idea bsica era: se presiona el botn, el temporizador comienza a transcurrir
algunos con olas en la apertura del dispositivo, las principales salidas de bucle y
los datos PCM de buffers con sus registros de voz en un archivo, se detiene el
temporizador y el archivo de audio se envi a Google para su reconocimiento.
Primera tarea consista en entender la codificacin flac en tiempo real, se puede
decir 'En * nix, puedo escribir par de comandos en el terminal y hacer todo:
grabar, codificar, publicar archivos flac y recibir respuesta del servidor. As que,
por qu no codificar archivos con el programa codificador iniciado despus de
grabar archivos de onda? - Porque es aburrido, imagnense: el programa graba
archivos de audio flac ya preparado!
Desde el momento en que escrib entonces alguna aplicacin para la conversin
por lotes archivos MP3 a OGG / Vorbis, me he alojado biblioteca que puede
codificar pcm a vorbis en tiempo real, tambin hubo memoria cclica para eso.
En ese punto, el controlador apropiado para el FLAC no esper. Usted puede
saber que Google acepta flac en 16 kHz y 16 bits por muestra con 1 (mono)
Formato de canal. Por ejemplo, en el uso de libflac, aado tres
funciones:InitialiseEncoder , ProcessEncoder , CloseEncoder que son,
respectivamente: archivos abiertos y os preparare encoder, subo a muestras
PCM encoder de 16 bits, cerrar el archivo y destruyo encoder. Una cosa: no
entiendo por qu no se puede aadir metadatos a los archivos
flac? Quizs charset problemas?
El artculo maravilloso: WaveLib , que tiene olas de aplicacin API incluida, que
utiliza la grabadora de clase: se inicia laWaveInRecorder y paralelamente
utiliza hilos para transmitir datos PCM para encoder.

Carga de archivos
El uso bsico funcin de carga est por debajo, cambie el parmetro lang
opcionalmente:
Collapse | Copiar cdigo

string resultado =
rate = 16000 " , los parmetros, nulo );

Respuesta del servidor se recibe en formato JSON.

Emitir Creacin

En este caso se puede utilizar el reconocimiento de voz? Tal vez para la creacin
de tema? Tal vez no es prctico, pero sin duda divertido.
La aplicacin web Redmine incluye servicio web REST. Por ello, podemos crear
problemas tanto como necesitamos, basta con especificar los proyectos y
seguimiento, por la forma en la lista de seguidores que slo poda obtener la
versin ms joven 1.3 *.
Collapse | Copiar cdigo

RedmineManager gerente = nueva RedmineManager (Configuration.RedmineHost,


Configuration.RedmineUser, Configuration.RedminePassword);
/ / NUEVO NMERO
var = newIssue nueva emisin
{
Subject = Ttulo,
Description = Descripcin,
= Proyecto de nueva IdentifiableName () {id = Projectid}
Rastreador = nueva IdentifiableName () {Id = TrackerId}
};
/ / Obtener ID DE USUARIO ACTUAL
usuario thisuser = (de u en manager.GetObjectList
<usuario> ( nueva System.Collections.Specialized.NameValueCollection
())
donde u.Login == Configuration.RedmineUser
. seleccione u) FirstOrDefault ();
si (thisuser! = NULL )
newIssue.AssignedTo = nueva IdentifiableName () {Id = thisuser.Id};
manager.CreateObject (newIssue);

Puntos de inters
Cuando todo termin, me llam la atencin para registrar tiempo de espera,
que te da 4 segundos para su discurso: no para todas las expresiones de la
misma puede ser apropiado, la forma tal vez necesita un poco de botn de
parada?
Memoria circular te salvar de la prdida de datos en caso de tales registros
directamente a flac. Cuando los datos proceden de la onda-in, se van en el
bfer de anillo.

Historia

28 de febrero 2012: Primera versin

Licencia

Introduction
As I already mentioned in my article A low-level audio player in C#, there are no
built-in classes in the .NET framework for dealing with sound. This holds true not
only for audio playback, but also for audio capture.
It should be noted, though, that the Managed DirectX 9 SDK does include
classes for high-level and low-level audio manipulation. However, sometimes
you dont want your application to depend on the full DX 9 runtime, just to do
basic sound playback and capture, and there are also some areas where
Managed DirectSound doesnt help at all (for example, multi-channel sound
playback and capture).
Nevertheless, I strongly recommend you to use Managed DirectSound for sound
playback and capture unless you have a good reason for not doing so.
This article describes a sample application that uses
the waveIn and waveOut APIs in C# through P/Invoke to capture an audio
signal from the sound cards input, and play it back (almost) at the same time.

Using the code


The sample code reuses the WaveOutPlayer class from my article A low-level
audio player in C#. The new classes in this sample
are WaveInRecorder and FifoStream.
The FifoStream class extends System.IO.Stream to implement a FIFO (first-in
first-out) of bytes. The overriddenWrite method adds data to the FIFOs tail, and
the Read method peeks and removes data from the FIFOs head.
TheLength property returns the amount of buffered data at any time.
Calling Flush will clear all pending data.
The WaveInRecorder class is analogous to the WaveOutPlayer class. In fact, if
you look at the source files, youll notice that the implementations of these
classes are very similar. As with WaveOutPlayer, the interface of this class has
been reduced to the strict minimum.
Creating an instance of WaveInRecorder will cause the system to start
recording immediately. Heres the code that creates
the WaveOutPlayer and WaveInRecorder instances.
Collapse | Copy Code

private void Start()


{
Stop();
try
{
WaveLib.WaveFormat fmt = new WaveLib.WaveFormat(44100, 16, 2);
m_Player = new WaveLib.WaveOutPlayer(-1, fmt, 16384, 3,
new WaveLib.BufferFillEventHandler(Filler));
m_Recorder = new WaveLib.WaveInRecorder(-1, fmt, 16384, 3,
new WaveLib.BufferDoneEventHandler(DataArrived));
}

catch
{
Stop();
throw;
}
}

The WaveInRecorder constructor takes five parameters. Except for the last
parameter, their meaning is the same as in WaveOutPlayer.
The first parameter is the ID of the wave input device that you want to use. The
value -1 represents the default system device, but if your system has more than
one sound card, then you can pass any number from 0 to the number of
installed sound cards minus one, to select a particular device.
The second parameter is the format of the audio samples.
The third and forth parameters are the size of the internal wave buffers and the
number of buffers to allocate. You should set these to reasonable values.
Smaller buffers will give you less latency, but the captured audio may have
gaps on it if your computer is not fast enough.
The fifth and last parameter is a delegate that will be called periodically as
internal audio buffers are full of captured data. In the sample application we just
write the captured data to the FIFO, like this:
Collapse | Copy Code

private void DataArrived(IntPtr data, int size)


{
if (m_RecBuffer == null || m_RecBuffer.Length < size)
m_RecBuffer = new byte[size];
System.Runtime.InteropServices.Marshal.Copy(data, m_RecBuffer, 0, size);
m_Fifo.Write(m_RecBuffer, 0, m_RecBuffer.Length);
}

Similarly, the Filler method is called every time the player needs more data. Our
implementation just reads the data from the FIFO, as shown below:
Collapse | Copy Code

private void Filler(IntPtr data, int size)


{
if (m_PlayBuffer == null || m_PlayBuffer.Length < size)
m_PlayBuffer = new byte[size];
if (m_Fifo.Length >= size)
m_Fifo.Read(m_PlayBuffer, 0, size);
else
for (int i = 0; i < m_PlayBuffer.Length; i++)
m_PlayBuffer[i] = 0;
System.Runtime.InteropServices.Marshal.Copy(m_PlayBuffer,
0, data, size);
}

Note that we declared the temporary buffers m_RecBuffer and m_PlayBuffer as


member fields in order to improve performance by saving some garbage
collections.
To stop streaming, just call Dispose on the player and capture objects. We also
need to flush the FIFO so that the next time Start is called there is no residual
data to play.

Collapse | Copy Code

private void Stop()


{
if (m_Player != null)
try
{
m_Player.Dispose();
}
finally
{
m_Player = null;
}
if (m_Recorder != null)
try
{
m_Recorder.Dispose();
}
finally
{
m_Recorder = null;
}
m_Fifo.Flush(); // clear all pending data
}

Conclusion

# Curl-H "Content-Type: audio / x-flac, tasa = 16000" "https://www.google.com/speechapi/v1/recognize?xjerr=1&client=chromium&lang=en-US"-F miarchivo = "@ C: \ input.flac"k-o "C: \ output.txt"
Funciona excelente! Slo algunas notas:
1) copiar y pegar: utilizar diferentes seales cotizacin
2) hacer que tipo de seguro = 16000 corresponde a la tasa de bits (audacity: antes de
grabar)!
3) que tena un poco mejor los resultados de grabacin mono
Alguien tiene algo sobre:
* Hecho espera de 100-continue
Me falta esta milisegundos para esperar a que ...

1.
2.

Path path = Paths.get("out.flac");


byte[] data = Files.readAllBytes(path);

3.
4.
5.

String request = "https://www.google.com/"+


"speech-api/v1/recognize?"+

6.

"xjerr=1&client=speech2text&lang=en-US&maxresults=10";

7.

URL url = new URL(request);

8.

HttpURLConnection connection = (HttpURLConnection) url.openConnection();

9.

connection.setDoOutput(true);

10.

connection.setDoInput(true);

11.

connection.setInstanceFollowRedirects(false);

12.

connection.setRequestMethod("POST");

13.

connection.setRequestProperty("Content-Type", "audio/x-flac; rate=16000");

14.

connection.setRequestProperty("User-Agent", "speech2text");

15.

connection.setConnectTimeout(60000);

16.

connection.setUseCaches (false);

17.
18.
DataOutputStream wr = new DataOutputStream(connection.getOutputStream ());
19.

wr.writeBytes(new String(data));

20.

wr.flush();

21.

wr.close();

22.

connection.disconnect();

23.
24.

System.out.println("Done");

25.
26.
27.

BufferedReader in = new BufferedReader(


new InputStreamReader(

28.

connection.getInputStream()));

29.

String decodedString;

30.

while ((decodedString = in.readLine()) != null) {

31.

System.out.println(decodedString);

32.

Path path = Paths.get("out.flac");


byte[] data = Files.readAllBytes(path);

String request = "https://www.google.com/"+


"speech-api/v1/recognize?"+
"xjerr=1&client=speech2text&lang=en-US&maxresults=10";
URL url = new URL(request);
HttpURLConnection connection = (HttpURLConnection)
url.openConnection();
connection.setDoOutput(true);
connection.setDoInput(true);
connection.setInstanceFollowRedirects(false);
connection.setRequestMethod("POST");
connection.setRequestProperty("Content-Type", "audio/x-flac;
rate=16000");
connection.setRequestProperty("User-Agent", "speech2text");
connection.setConnectTimeout(60000);
connection.setUseCaches (false);

DataOutputStream wr = new
DataOutputStream(connection.getOutputStream ());
wr.writeBytes(new String(data));

wr.flush();
wr.close();
connection.disconnect();

System.out.println("Done");

BufferedReader in = new BufferedReader(


new InputStreamReader(
connection.getInputStream()));
String decodedString;
while ((decodedString = in.readLine()) != null) {
System.out.println(decodedString);
}

Debe utilizar wr.write (datos); lugar de wr.writeBytes (new String (datos));


Google la respuesta:
{ "status" : 0 , "id" : "e0f4ced346ad18bbb81756ed4d639164-1" , "hiptesis" :
[{ "hablasen" : "hola cmo ests" , "confianza" : 0,94028234 }, { "hablasen" :
"hello how r que " }, { "hablasen" : "Cmo ests hoy u" }, { "hablasen" : "hola
cmo ests en" }]}

1.

package test;

2.
3.

import java.io.BufferedReader;

4.

import java.io.DataOutputStream;

5.

import java.io.InputStreamReader;

6.

import java.net.HttpURLConnection;

7.

import java.net.MalformedURLException;

8.

import java.net.URL;

9.

import java.nio.file.Files;

10.

import java.nio.file.Path;

11.

import java.nio.file.Paths;

12.
13.

public class TestGoogleApiForSpeechRecognition {

14.
15.

public static void main(String[] args) throws Exception {

16.
17.

Path path = Paths.get("C:\\Users\\CDAC\\Downloads\\priyanka.flac");

18.

byte[] data = Files.readAllBytes(path);

19.
20.

String request = "https://www.google.com/"+

21.

"speech-api/v1/recognize?"+

22.

"xjerr=0&client=speech2text&lang=en-US&maxresults=20";

23.

URL url = new URL(request);

24.

HttpURLConnection connection = (HttpURLConnection)


url.openConnection();

25.

connection.setDoOutput(true);

26.

connection.setDoInput(true);

27.

connection.setInstanceFollowRedirects(false);

28.

connection.setRequestMethod("POST");

29.

connection.setRequestProperty("Content-Type", "audio/x-flac;
rate=16000");

30.

connection.setRequestProperty("User-Agent", "speech2text");

31.

connection.setConnectTimeout(60000);

32.

connection.setUseCaches (false);

33.
34.

DataOutputStream wr = new
DataOutputStream(connection.getOutputStream ());

35.

wr.write(data);

36.

wr.flush();

37.

wr.close();

38.

connection.disconnect();

39.
40.

System.out.println("Done");

41.
42.

BufferedReader in = new BufferedReader(new


InputStreamReader(connection.getInputStream()));

43.
44.

String decodedString;

45.

while ((decodedString = in.readLine()) != null) {

46.

System.out.println(decodedString);

47.
48.

}
}

49.

package test;

import java.io.BufferedReader;
import java.io.DataOutputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class TestGoogleApiForSpeechRecognition {

public static void main(String[] args) throws Exception {

Path path =
Paths.get("C:\\Users\\CDAC\\Downloads\\priyanka.flac");
byte[] data = Files.readAllBytes(path);

String request = "https://www.google.com/"+


"speech-api/v1/recognize?"+
"xjerr=0&client=speech2text&lang=enUS&maxresults=20";
URL url = new URL(request);

HttpURLConnection connection = (HttpURLConnection)


url.openConnection();
connection.setDoOutput(true);
connection.setDoInput(true);
connection.setInstanceFollowRedirects(false);
connection.setRequestMethod("POST");
connection.setRequestProperty("Content-Type", "audio/xflac; rate=16000");
connection.setRequestProperty("User-Agent",
"speech2text");
connection.setConnectTimeout(60000);
connection.setUseCaches (false);

DataOutputStream wr = new
DataOutputStream(connection.getOutputStream ());
wr.write(data);
wr.flush();
wr.close();
connection.disconnect();

System.out.println("Done");

BufferedReader in = new BufferedReader(new


InputStreamReader(connection.getInputStream()));

String decodedString;
while ((decodedString = in.readLine()) != null) {
System.out.println(decodedString);
}
}

1.
2.

CAMINO = Caminos. conseguir ( "out.flac" ) ;


byte [ ] datos = Archivos. ReadAllBytes ( ruta ) ;

3.
4.

Cadena peticin = "https://www.google.com/" +

5.

"Speech-api/v1/recognize?" +

6.

"Xjerr = 1 & client = speech2text & lang = es-US & maxResults = 10" ;

7.

URL url = nueva URL ( peticin ) ;

8.

HttpURLConnection conexin = ( HttpURLConnection ) url. openConnection ( ) ;

9.

. conexin setDoOutput ( verdadero ) ;

10.

. conexin setDoInput ( verdadero ) ;

11.

conexin. setInstanceFollowRedirects ( false ) ;

12.

conexin. setRequestMethod ( "POST" ) ;

13.

conexin. setRequestProperty ( "Content-Type" , "audio / x-flac, tasa = 16000" ) ;

14.

conexin. setRequestProperty ( "User-Agent" , "speech2text" ) ;

15.

. conexin setConnectTimeout ( 60000 ) ;

16.

conexin. setUseCaches ( false ) ;

17.
18.
DataOutputStream wr = nueva DataOutputStream ( conexin. getOutputStream ( ) ) ;
19.

. wr writeBytes ( nueva cadena ( datos ) ) ;

20.

wr. ras ( ) ;

21.

. wr close ( ) ;

22.

. conexin desconexin ( ) ;

23.
24.

Sistema . cabo . println ( "Done" ) ;

25.
26.

BufferedReader en = nuevo BufferedReader (

27.

nueva InputStreamReader (

28.

conexin. getInputStream ( ) ) ) ;

29.

Cadena DecodedString ;

30.

mientras que ( ( DecodedString = pulg readLine ( ) ) ! = NULL ) {

31.

Sistema . cabo . println ( DecodedString ) ;

32.

Path path = Paths.get("out.flac");


byte[] data = Files.readAllBytes(path);

String request = "https://www.google.com/"+


"speech-api/v1/recognize?"+
"xjerr=1&client=speech2text&lang=en-US&maxresults=10";
URL url = new URL(request);
HttpURLConnection connection = (HttpURLConnection)
url.openConnection();
connection.setDoOutput(true);
connection.setDoInput(true);
connection.setInstanceFollowRedirects(false);
connection.setRequestMethod("POST");

connection.setRequestProperty("Content-Type", "audio/x-flac;
rate=16000");
connection.setRequestProperty("User-Agent", "speech2text");
connection.setConnectTimeout(60000);
connection.setUseCaches (false);

DataOutputStream wr = new
DataOutputStream(connection.getOutputStream ());
wr.writeBytes(new String(data));
wr.flush();
wr.close();
connection.disconnect();

System.out.println("Done");

BufferedReader in = new BufferedReader(


new InputStreamReader(
connection.getInputStream()));
String decodedString;
while ((decodedString = in.readLine()) != null) {
System.out.println(decodedString);
}

Você também pode gostar