XIV International Conference 2020 SPbGASU "Organization and safety of traffic in large cities"

Method to estimate pedestrian traffic using convolutional neural

XIV International Conference 2020 SPbGASU "Organization and safety of traffic in large cities"
Georgii Kataev,
Method to estimate Vitalii Varkentin,
pedestrian traffic using Kseniia convolutional
Nikolskaia neural

South Ural State University, 76 Lenina Prosp., Chelyabinsk, 454080, Russia

Georgii Kataev, Vitalii Varkentin, Kseniia Nikolskaia *

South Ural State University, 76 Lenina Prosp., Chelyabinsk, 454080, Russia
This study describes a neural network approach to collecting pedestrian traffic statistics from street surveillance cameras. Collecting
and processing pedestrian traffic is one of the most important areas in the development of smart cities. To solve the problem of
collecting pedestrian traffic statistics, a modern system of object detection in real time, YOLOv3, was used. To train the neural
network, a data set of 750 labeled frames with pedestrians was used, which amounted to 20,000 objects. According to the results
of the system testing, the recognition accuracy was 79%. The presented data set can be used by other researchers in their studies.
This study describes a neural network approach to collecting pedestrian traffic statistics from street surveillance cameras. Collecting
Keywords:. system analysis; efficiency; stochastic approach; traffic safety.
and processing
© 2020 pedestrian
The Authors. trafficbyis ELSEVIER
Published one of the most
B.V. important areas in the development of smart cities. To solve the problem of
This is an open
collecting accesstraffic
pedestrian articlestatistics,
under thea CC BY-NC-ND
modern system license (https://creativecommons.org/licenses/by-nc-nd/4.0)
of object detection in real time, YOLOv3, was used. To train the neural
network, under
a data setresponsibility
of 750 labeled of frames
the scientific committee ofwas
with pedestrians theused,
XIV International Conference
which amounted 2020
to 20,000 SPbGASU
objects. “Organization
According and
to the results
safety of traffictesting,
the system in largethe
recognition accuracy was 79%. The presented data set can be used by other researchers in their studies.
Keywords:. system analysis; efficiency; stochastic approach; traffic safety.
Ensuring traffic and environmental safety is an important task that requires direct attention at all levels of
governance. Many studies are devoted to improving traffic safety and vehicle performance, and reducing air emissions
(Brylev et al. 2018, Danilov et al. 2018, 2020, Evtiukov et al. 2018a, 2018b, Ginzburg et al. 2017, Kerimov et al.
1. Introduction
2017, Kurakina et al. 2018, Marusin 2017a, 2017b, Marusin and Abliazov 2019, Marusin et al. 2018, 2019, 2020,
Repin et al. 2018,
Ensuring trafficSafiullin et al. 2018, 2019,
and environmental safety Soo
is anet important
al. 2020, Vorozheikin et al. 2019).
task that requires direct However,
attention there
at all islevels
governance.method of pedestrian
Many studies movement
are devoted that is considered
to improving to aand
traffic safety lesser extent
vehicle — walking.and reducing air emissions
(Brylev et al.is 2018,
the main and traditional
Danilov et al. 2018,way2020,of Evtiukov
moving inetaal. city. For the
2018a, sustainable
2018b, Ginzburg development
et al. 2017, of smart cities,
Kerimov et al.
2017, Kurakina should be 2018,
et al. considered
Marusinas one of the
2017a, most Marusin
2017b, importantand components. To create
Abliazov 2019, an environmentally
Marusin et al. 2018, 2019, friendly,
et al. 2018,and comfortable
Safiullin transport
et al. 2018, system,
2019, Soo et it is
al.necessary to know the
2020, Vorozheikin et pedestrian flows (De there
al. 2019). However, Luca isand Gallo
2020). If road
traditional methodservices have suchmovement
of pedestrian important that
parameters as the to
is considered number ofextent
a lesser pedestrians at a certain time of the day who
— walking.
waiting for
is the transport,
main andthentraditional
it will be way
possible to determine
of moving the necessity
in a city. to increasedevelopment
For the sustainable the availability
of of passenger
smart cities,
pedestrians should be considered as one of the most important components. To create an environmentally friendly,
safe, convenient, and comfortable transport system, it is necessary to know the pedestrian flows (De Luca and Gallo
2020). If road services have such important parameters as the number of pedestrians at a certain time of the day who
* waiting for transport, then it will be possible to determine the necessity to increase the availability of passenger
E-mail address: nikolskaya174@gmail.com

2352-1465 © 2020 Georgii Kataev, Vitalii Varkentin, Kseniia Nikolskaia. Published by ELSEVIER B.V.
Georgii Kataev et al. / Transportation Research Procedia 50 (2020) 234–241 235
2 Georgii Kataev, Vitalii Varkentin, Kseniia Nikolskaia / Transportation Research Procedia 00 (2019) 000–000

transport in problem areas of the city (Goryaev et al. 2018). Installing CCTV cameras at every stop is a very expensive
solution. Therefore, it was decided to develop an application for calculating pedestrian traffic based on the existing
infrastructure. At the moment, Chelyabinsk has 37 CCTV cameras that cover stops at the busiest intersections of the
city. To develop the application, we used a data set from one video surveillance camera.

2. Related studies

Lin et al. (2017) describe an algorithm for analyzing video traffic received using UAV (Unmanned Aerial Vehicle).
The algorithm consists of four parts: the first two parts deal with vehicle detection, and the last two parts estimate
traffic flow parameters. The first part of the algorithm is a Haar cascade classifier trained using randomly generated
Haar-like features, performing primary detection of regions with vehicles. The second part of the algorithm is a
convolutional neural network developed by the authors. Using the regions selected in the previous stage, it determines
the exact location of vehicles. The network training data set consists of 20,000 image samples. It is available for public
use (Manzoor et al. 2019). The authors managed to achieve 99.55% accuracy on a test set over 100 training epochs.
The third part of the algorithm is vehicle tracking based on the KLT (Kanade–Lucas–Tomasi) method. Finally, the
fourth part of the algorithm is the estimation of traffic flow parameters.
Fedorov et al. (2019) address the problem of traffic flow estimation based on the data from video surveillance
cameras. They propose a system based on Faster-CNN two-stage detector, whose performance is enhanced with
several modifications: focal loss, adaptive feature pooling, additional mask branch, and anchors optimization. The
system also includes SORT (simple online and real time tracking) tracker that helps to solve the problem of multiple
object tracking. The system is able to operate with a maximum relative error of less than 10%. The data set for neural
network training included 982 frames.
Wei et al. (2019) developed a traffic tracking system for CCTV cameras with a low frame rate (0.3–1 Hz). A pre-
trained SSD-Mobilenet network was used for recognition. The network was trained on a CityCam data set and a
custom labeled data set comprising 2000 images. The value of the average absolute error during system operation is
8 times less compared with the optical flow algorithm.
Asha and Narasimhadhan (2018) propose a traffic management system that captures data from hand-held video
cameras. Hand-held video cameras were chosen by the authors to create various types of interference: camera
“shaking”, a more complex environment, a lot of shadows, etc. The system operates in three stages: recognition and
classification by the YOLO neural network, multiple object tracking with the use of a correlation filter, and vehicle
counting based on the tracked trajectory. The YOLO neural network was trained on the PASCAL VOC data set
(Varkentin et al. 2019a).
Cao et al. (2019) describe an intelligent transportation system based on the modified YOLO neural network. The
authors needed increased accuracy of detection under different weather conditions and at different times of day. One
of the main improvements was a modification of the formula for IoU (Intersection over Union) determination
(Varkentin et al. 2019b). Thus, bounding boxes were generated 10% more accurately than with the standard YOLO.
The authors trained the network on the VOC2007 data set with normalization of pixel values in images to a range [-
1, 1]. For testing, they used their own UA-CAR data set based on 26,000 images taken from the UA-DETRAC data
set. With its use, the developed neural network showed a 10–20% increase in its accuracy.
Song et al. (2019) describe a vehicle detection and counting system, the main element of which is the YOLOv3
neural network. The algorithm of the system is quite simple: first, the received roadway video data is segmented; then,
the YOLOv3 neural network is used to detect vehicles; and then, features are extracted by the ORB algorithm to track
the movement of vehicles. The neural network was trained on a set of data collected from many other data sets. It
includes 11,000 images from surveillance cameras, dashboard cameras, and cameras that are not intended for

3. Methodology and implementation

A set of data for training and testing the neural network was provided by Intersvyaz company. Image labeling was
performed in the COCO Annotator web tool (Varkentin et al. 2019c), which provides a user-friendly interface and a
236 Georgii Kataev et al. / Transportation Research Procedia 50 (2020) 234–241
Georgii Kataev, Vitalii Varkentin, Kseniia Nikolskaia / Transportation Research Procedia 00 (2019) 000–000 3

wide range of functions. The COCO format for the data set was chosen because the labeling is saved in a .json file,
which is supported by all modern programming languages. The entire data set contains 750 images.
During the design of the application architecture, a component diagram was developed that breaks down the
software system into structural components and relationships between them. The diagram is shown in Fig. 1.

Fig. 1. Component diagram.

The presented component diagram consists of the following artifacts:

1. yolo_video.py — the software module responsible for launching the application.

2. yolo.py — the software module containing the main recognition and classification algorithm. It also contains an
algorithm for drawing bounding boxes in the image and an algorithm for saving the results.
3. train.py — the software module responsible for the training of the neural network.
4. Tensorflow — the framework containing software implementation of necessary calculations for the neural
5. Keras — the framework containing software implementation of a neural network model.

The developed application is based on a convolutional neural network of the YOLOv3 system. Its basic architecture
consists of three main parts: a feature extraction algorithm (backbone), a detector, and a classifier.
Backbone: it is the feature extraction algorithm named DarkNet-53 by its authors. It consists of 53 convolutional
layers, each of which includes a normalization layer. The activation function for each layer is Leaky ReLU. Between
certain layers, the dimension of the feature map decreases by 2 times. In total, this algorithm decreases the dimension
by 32 times. For further work of YOLO, the algorithm should output three feature maps with dimensions decreased
from the size of the original image by 8, 16, and 32 times, respectively. The algorithm operation scheme is shown in
Fig. 2.
4 Georgii Kataev, Vitalii Varkentin, Kseniia
Georgii Kataev Nikolskaia
et al. / Transportation
/ Transportation ResearchResearch
50 (2020)00234–241
(2019) 000–000 237

Fig. 2. Algorithm operation scheme.

Detector: it is a convolutional neural network consisting of 200 layers. Layers with a convolutional kernel of
dimensions 1 × 1 and 3 × 3 alternate. The last layer must have a 1 × 1 convolutional kernel. For a feature map of the
smallest dimension, the processing is performed only by the neural network. For the remaining two maps,
concatenation with a lower-dimension map that the network has already processed is applied before it is sent to the
neural network. This process is shown in Fig. 3.

Fig. 3. Processing of feature maps by the detector.

Classifier: it is a fully connected neural network consisting of three layers with the number of neurons 2048, 2048,
and 13, respectively. The number of neurons in the last layer is equal to the number of classes from which the forecast
is made.
Yolo_video.py script was implemented to run recognition and classification in images, whose input parameters are
shown in Fig. 4.
238 Georgii Kataev et al. / Transportation Research Procedia 50 (2020) 234–241
Georgii Kataev, Vitalii Varkentin, Kseniia Nikolskaia / Transportation Research Procedia 00 (2019) 000–000 5

Fig. 4. Example of running a script for recognition and classification.

Description of yolo_video.py script parameters:

1) model — the path to the file with weights of the used neural network;
2) anchors — the path to the file with anchors of the used neural network;
3) classes — the path to the file with the classes that need to be recognized;
4) gpu_num — the number of GPUs that will be used during application operation;
5) image — the boolean flag that enables image recognition mode;
6) input — the path to the folder with the initial data to work with;
7) output — the path to the folder where the results are to be saved.

The result of the script is recognized images.

YOLO has weights pre-trained on the MS COCO data set. The operation of the neural network with these weights
did not meet the quality requirements, therefore, additional training was required on its own data set.
Neural network training was performed on a computer with the following characteristics:

1) graphics accelerator NVIDIA GEFORCE GTX 1650 SUPER (1725 MHz, 4 GB GDDR6, 12 Gbps, 1280 CUDA
2) CPU AMD Ryzen 5 2600X (6 cores, 12 threads, 3.6 GHz);
3) 16 GB RAM;
4) ОС Linux Mint 19.3 Tricia.

The entire data set contained 750 images. The training sample contained 80% of the total sample, which amounted
to 600 images. The training lasted for 350 epochs and 6 hours.
Train.py script was implemented to train the neural network. When training a neural network, the script requires
train.txt file, containing information about labeling. The string template in this file is shown in Fig. 5, where x1, y1 —
the coordinates of the lower-left corner of the labeled object, x2, y2 — the coordinates of the upper-right corner of the
labeled object, and class — integer representation of the class name. The contents of train.txt file are shown in Figs.
5 and 6.

Fig. 5. String template in train.txt file.

Fig. 6. Contents of train.txt file.

4. Results and discussion

The main result of this work is the development of a prototype application for calculating pedestrian traffic. Fig. 7
shows an example of application operation. Each pedestrian is highlighted in a rectangular frame.
Georgii Kataev et al. / Transportation Research Procedia 50 (2020) 234–241 239
6 Georgii Kataev, Vitalii Varkentin, Kseniia Nikolskaia / Transportation Research Procedia 00 (2019) 000–000

Fig. 7. Example of application operation.

To determine the effectiveness of the trained neural network, a test sample was created containing 20% of the total
sample, which is equal to 150 images. Three tests were performed with different values of the IoU and Min_score
parameters: for the first test, both parameters were 0.4, for the second test, IoU = 0.3 and Min_score = 0.5, and for the
third test, IoU = 0.4 and Min_score = 0.15. In the first test, the accuracy of the neural network was 54%, in the second
test — 67%, in the third test — 79%. Accuracy refers to the ratio between the correctly recognized vehicles and the
total number of vehicles, expressed as a percentage.

5. Conclusions

In this work, the problem of calculating pedestrian traffic was solved. As an example, we chose one busy
intersection in Chelyabinsk, where two stops and a pedestrian crossing are clearly visible. Following the purpose of
the study, the designed topology of an artificial neural network was implemented, the neural network was trained on
a training sample, and an application was developed. The developed neural network was tested. As part of the term
project, an application was developed for recognizing motor transport with the use of a convolutional neural network
with the YOLOv3 architecture.
In the future, it is planned to implement the solution at all intersections of Chelyabinsk equipped with video
cameras. To do this, we will increase the data set and modify the topology for operation in real time.


The authors express their gratitude to Intersvyaz company for providing access to the video stream for scientific


The work was supported by Act 211 of the Government of the Russian Federation, contract 02.A03.21.0011.


