关键词: Horn detection Machine learning Scenario-driven Social inclusion

来  源:   DOI:10.1016/j.dib.2024.110678   PDF(Pubmed)

Abstract:
In recent years, there has been significant growth in the development of Machine Learning (ML) models across various fields, such as image and sound recognition and natural language processing. They need to be trained with a large enough data set, ensuring predictions or results are as accurate as possible. When it comes to models for audio recognition, specifically the detection of car horns, the datasets are generally not built considering the specificities of the different scenarios that may exist in real traffic, being limited to collections of random horns, whose sources are sometimes collected from audio streaming sites. There are benefits associated with a ML model trained on data tailored for horn detection. One notable advantage is the potential implementation of horn detection in smartphones and smartwatches equipped with embedded models to aid hearing-impaired individuals while driving and alert them in potentially hazardous situations, thus promoting social inclusion. Given these considerations, we developed a dataset specifically for car horns. This dataset has 1,080 one-second-long .wav audio files categorized into two classes: horn and not horn. The data collection followed a carefully established protocol designed to encompass different scenarios in a real traffic environment, considering diverse relative positions between the involved vehicles. The protocol defines ten distinct scenarios, incorporating variables within the car receiving the horn, including the presence of internal conversations, music, open or closed windows, engine status (on or off), and whether the car is stationary or in motion. Additionally, there are variations in scenarios associated with the vehicle emitting the horn, such as its relative position-behind, alongside, or in front of the receiving vehicle-and the types of horns used, which may include a short honk, a prolonged one, or a rhythmic pattern of three quick honks. The data collection process started with simultaneous audio recordings on two smartphones positioned inside the receiving vehicle, capturing all scenarios in a single audio file on each device. A 400-meter route was defined in a controlled area, so the audio recordings could be carried out safely. For each established scenario, the route was covered with emissions of different types of horns in distinct positions between the vehicles, and then the route was restarted in the next scenario. After the collection phase, the data preprocessing involved manually cutting each horn sound in multiple one-second windowing profiles, saving them in PCM stereo .wav files with a 16-bit depth and a 44.1 kHz sampling rate. For each horn clipping, a corresponding non-horn clipping in close proximity was performed, ensuring a balanced model. This dataset was designed for utilization in various machine learning algorithms, whether for detecting horns with the binary labels, or classifying different patterns of horns by rearranging labels considering the file nomenclature. In technical validation, classifications were performed using a convolutional neural network trained with spectrograms from the dataset\'s audio, achieving an average accuracy of 89% across 100 trained models.
摘要:
近年来,机器学习(ML)模型在各个领域的发展都有了显著的增长,例如图像和声音识别以及自然语言处理。他们需要用足够大的数据集来训练,确保预测或结果尽可能准确。当涉及到音频识别的模型时,特别是检测汽车喇叭,通常不会考虑实际流量中可能存在的不同场景的特殊性来构建数据集,仅限于随机角的集合,其来源有时是从音频流站点收集的。在针对喇叭检测定制的数据上训练的ML模型具有相关的益处。一个显着优势是在配备嵌入式模型的智能手机和智能手表中可能实现喇叭检测,以帮助听力受损的人在驾驶时并在潜在危险情况下提醒他们。促进社会包容。鉴于这些考虑,我们专门为汽车喇叭开发了一个数据集。这个数据集有1,080个一秒长。wav音频文件分为两类:喇叭和不喇叭。数据收集遵循精心制定的协议,旨在涵盖真实交通环境中的不同场景。考虑所涉及车辆之间的不同相对位置。该协议定义了十个不同的场景,在接收喇叭的汽车中包含变量,包括内部对话,音乐,打开或关闭窗户,发动机状态(打开或关闭),以及汽车是静止的还是运动的。此外,与车辆发出喇叭相关的场景有变化,比如它的相对位置,旁边,或者在接收车辆的前面-以及使用的喇叭类型,其中可能包括短鸣喇叭,一个长时间的,或者三个快速鸣喇叭的节奏模式。数据收集过程从位于接收车辆内的两台智能手机上的同时录音开始,在每个设备上的单个音频文件中捕获所有场景。在受控区域定义了一条400米的路线,所以录音可以安全进行。对于每个已建立的场景,这条路线在车辆之间的不同位置覆盖着不同类型的喇叭,,然后在下一个场景中重新启动路由。在收集阶段之后,数据预处理涉及手动切割多个一秒窗口配置文件中的每个喇叭声音,将它们保存在PCM立体声中。具有16位深度和44.1kHz采样率的wav文件。对于每个喇叭剪裁,进行了相应的非喇叭剪切,确保一个平衡的模型。这个数据集被设计用于各种机器学习算法,是否用于检测带有二进制标签的喇叭,或通过考虑文件命名法重新排列标签来对不同图案的喇叭进行分类。在技术验证中,使用卷积神经网络进行分类,该卷积神经网络使用来自数据集音频的频谱图进行训练,在100个训练模型中实现89%的平均准确率。
公众号