INTERSPEECH—the annual conference of the International Speech Communication Association (ISCA)—is one of my favorite conferences to attend. I haven't attenedd that many, because the timing hasn't always suited my schedule, and I have really only been in the field in the last decade or so. But it's always a great place to go and find out what's new in the area of speech technology and closely related areas.
This year's conference was to be held in Shanghai, but, as you might imagine, it went all online. So, I got to attend my third major online conference since CoronaVirus hit. And the organizers did not disappoint. They organized it via Zoom and apparently contracted for a fancy webinar service meaning that as participants, we could not share our video screens at all, and could only share audio if we were nominated to do so. But a useful feature was a Q&A tool (not the same as the usual Zoom chat tool) where one could ask questions during presentations. Actually, you could ask questions any time throughout an entire session consisting of anywhere from 8-15 speakers. Unfortunately, though, because each session was only one hour long, they really didn't actually have much time for each speaker to respond to questions. So, some sessions I "attended", the session moderators seemed to be rushing things along and stopping Q&A time sometimes after only one question, just in order to stay on schedule. I felt rather sorry for those who had signed up to give posters: The experience they actually got during the conference was probably nothing like what they would normally have gotten at a face-to-face poster session.
Yet, on the other hand, I understand that attendance at INTERSPEECH was the largest ever (because of online participation), which means that those poster presenters may have gotten more exposure than they ever would have in a face-to-face conference. But it may be awhile before they realize the benefit of that exposure. Certainly the conference experience for them was vastly different than a normal poster session.
Anyway, I enjoyed several good presentations, some of them focusing explicitly on filled pauses. Here are links to a few. [clicking on the "link" will go to a page where you can read the abstract, watch a video presentation, and download the paper.]
- Correlating cepstra with formant frequencies: implications for phonetically-informed forensic voice comparison (link)
- Correlation between prosody and pragmatics: case study of discourse markers in French and English (link)
- The phonetic bases of vocal expressed emotion: natural versus acted (link)
- Do face masks introduce bias in speech technologies? The case of automated scoring of speaking proficiency (link)
I also especially enjoyed one of the pre-conference Tutorials on Spoken Dialog for Social Robots (link), given by Tatsuya Kawahara and Kristiina Jokinen. In great and careful detail, they went through many of the issues and factors to consider in the creation and generation of social robot speech. Prof. Kawahara pointed out that some of the widely known examples of talking social robots (e.g., "Pepper", here in Japan) are not terribly successful from the point of view of speech communication. That is, people end up not interacting with many of these via speech (reverting to touch-screens, for example), or, at best, very limited, domain-specific speech. So, he made a strong case for considering carefully what the ideal domain would be for enabling social robot speech communication. One good scenario seemed to be interlocutors for the elderly, which is certainly one area that researchers in Japan are pushing.
Prof. Jokinen talked about a variety of topics, but one that really hit me was the question of ethics. Since there wasn't much time for questions at the end, I had an e-mail exchange with both of them afterward on the question of whether robot speech should include disfluencies. This has been on my mind a lot recently. The basic question is whether it's ethical for a robot to insert disfluencies into their speech, even though they might not be "authentic" disfluencies. Prof. Kawahara point out that we need to have a clearer definition of what counts as disfluencies, since we might make the case that some so-called disfluencies in natural human speech are used facilitatively by speakers. This is a point that I sympathize with. But we also agreed that just casually inserting disfluencies in order merely to make the robot speech sound more natural is potentially deceptive. And yet, I would probably still argue that as long as human users are fully aware that they are interacting with an artificial agent, then, if the disfluencies actually have a positive influence on the interaction, it could be justified.
Anyway, as I say, INTERSPEECH this year was quite an event. It was the first time I could participate fully in a conference, yet neither leave home, nor cancel any of my normal work schedule (in Japan time, the entire conference was held roughly 6 pm to midnight each night). There's a part of me that things, "I could get used to this!".
But no. Online conferences still do not capture much of the spontaneity and serendipitous interactions that occurs at a physical conference. Perhaps one day they will figure out how to replicate that, but not yet.
So, I am hoping next year's INTERSPEECH will be back on-site.