LTE Troubleshooting Cases in Huawei Equi
LTE Troubleshooting Cases in Huawei Equi
By Jose Heredia
https://joseh.me
INTRODUCTION
RAN troubleshooting for LTE sites is a continuous task that must be taken with high priority
since it can have a huge impact on users causing higher churn if it is not managed timely. As an
RF engineer, we need to monitor KPI daily along with customer complaints, this way we can
proactively optimize the network. As a general rule, we should monitor network level KPIs, but
also analyze Top N cells/sites, since the impact of these sites in the network performance may
be unnoticed, but locally users may be experiencing bad performance. Some of the most
2
common KPI to monitor are: RRC accessibility, eRAB accessibility, service drop and DL
Throughput.
In the next pages we will be reviewing some typical RF troubleshooting cases that we can face
when optimizing LTE networks.
In any mobile network, cells must have nearby cells defined as neighbors in order to allow
handovers (moving from one cell to another, without affecting the ongoing service: voice call,
video streaming, internet browsing etc.), if this is not correctly set up then handover failures
occur which can lead to service drops.
In Huawei equipment, it is possible to export the neighbor configuration of all sites and
compare the parameters with the real values configured in each site. For example, we can get
the current configuration of each cell for PCI, TAC, CellID, eNB ID etc. and then review how the
neighbors are defined to check whether those parameters are correctly configured. Also, drive
tests can be a useful way to find this kind of issues, if during the mobility tests there are many
service drops, it is very likely neighbor configuration is not correct.
After the audit of neighbor configuration is performed and the differences are fixed in the
network, there is an obvious improvement in LTE retainability (Fig 1). Also, it can be noticed
that execution attempts and successes increase a lot after the change (Fig 2), this is as expected
since this kind of failures impacts the preparation phase (before execution) so it is normal that
after the preparation failures are reduced, there will be a higher number of execution attempts
and success.
For this case we have 2 troubleshooting problems handled differently but both having impact in
the reselection from 3G to 4G.
When having 3G to 4G reselection issues, the first thing we check is whether there is LTE
coverage in the area, field tests force the phone to select 4G and they share good RF
levels for LTE (RSRP above -90 dBm, SINR above 7 dB).
After confirming that LTE is available in the problematic area, the next step is to check
whether the 3G serving cell has the correct configuration for 4G reselection. This is
For this case, it was found that for the 3G serving cell, there was no object created in this
MO, which is the main reason why the 3G to 4G reselection was not happening, after
checking for the all site, it was found that there was no MO defined for any cell in the
site.
As can be seen in Fig 3, it is very important to define the LTE carriers available in the
area and set a priority which should be higher priority than the 3G carrier priority.
Another important parameter to configure correctly is the Measurement bandwidth which
should match the real bandwidth of the LTE carrier.
In the Fig 4, it is noticed how the UE moves from 4G to 3G and then to 2G during CSFB.
According to the current strategy set by the customer for its network, 3G to 2G handover
should not be enabled if the 3G coverage is continuous. 3G Drive tests in the area show
stable connection and good RSCP levels in the cluster, so this IRAT handover should be
disabled.
After deactivating the switch, the field test team checked that the UE remained in 3G
during the voice call and returned to LTE immediately after the call was finished.
Case 3: RRC Access degradation due to high random access preamble usage
By analyzing random access preamble usage of the network, it is noticed few cells reaching
100% at some hours.
In the previous chart, it is noticeable that the random access preamble usage reaches maximum
values in some peak hours. Fig 9 shows the time advance of samples located within 1Km and
2Km for those 1 of the top cells, the number of samples increases during the same time the
random access preamble usage reaches its maximum.
Based on these results, an increase of 2 degrees in electrical downtilt for this cell will help to
reduce the coverage and share the traffic increase in peak hours reducing the probability of
random access preamble usage congestion.
For the remaining top cell, the coverage is well contained so it is recommended to perform a
re-planning of random access resources.
Analyzing handover preparation success rate, it was found some top cells with KPI below 90%.
In fig 10 can be observed how the process to get a handover preparation failure happens.
In order to confirm the root cause, a trace was performed in the S1 interface of the impacted
cell. This trace shows several Handover requirements and failures.
Fig 12 - S1AP_Handover_Preparation_Fail
The reason for the failure is Unknown Target ID which means that the source cell has wrong
configuration for the target neighbor cell. We could directly check the configuration and
compare whether the real cell configuration is set in the neighbor definition, but for this case we
check the message S1AP_Handover_Require where the target cell parameters are sent.
Fig 13 - S1AP_Handover_Require
The message shows that the source cells is trying to performed a Handover to a cell with
following parameters:
After checking the configuration of the target cell, it is found that for eNB ID 800452, the real
TAC is 13501. So we can conclude that the preparation failure occurs because the neighbor cell
definition in the source cell is wrong (TAC doesn't match).
Drive test for SSV shows Max DL throughput below 4Mbps where the target value should be 25
Mbps on average. It was confirmed with traffic KPI, that the cell was on low load at the time of
the test, so we discarded high load as a possible cause for the low throughput.
In order to test only the RF environment, a packet injection test was configured in the eNB
pointing to the test UE. This packet injection basically orders the eNB to send a big number of
packets to a specific UE to test DL throughput in the RF environment without taking into
consideration Tx or Core network, which means the packets are generated in the eNB and sent
through the air interface to the UE.
The tests show that the Max DL Throughput reached 30Mbps confirming that there is no
limitation in the RF environment.
Based on these results, it was escalated to the Tx team who found that the problem was due to a
limitation on the DSCP in the Tx. Once the customer fixed the issue, it was noticed how the Tx
throughput increased.
Fig 16 - RxMaxSpeed at Tx
Rank indicator 2 usage KPI went down to 0% after a specific date. In LTE networks, MIMO is a
feature that allows multiple transmission and reception at the same time for capable UEs.
Increasing the number of parallel streams increases average throughput. Rank indicator 2
means there are 2 unique streams between eNB and UE which theoretically can be considered as
twice throughput.
Based on these results an implementation team was sent to the site to review the ports
connection in the antenna and RRU. When using MIMO, there should be more than 1 Tx port
between antenna and RRU. For example for 2T2R, usually there are 2 ports (each port is a
Tx/Rx), if there is 2T4R then there should be 4 ports (2 Tx/Rx and 2 Rx).
For this case, the site was a 2T4R, so there were 4 ports connected to each antenna.
When physical problems are suspected, it is also important to review alarms on the site. For this
case one alarm is triggered:
After reviewing with the integration team, it was found that the site was configured with CPRI
port bit rate of 1.25G. For the 2T4R scenario, this is a low capability CPRI configuration. The
installed optical transceiver is low capacity and cannot support 2.5G configuration, it is required
to change the transceiver for one capable of reaching 2.5G.
The main reason for the degradation in eRAB accessibility is RNL (Radio network layer). In the
figure below, there are 2 specific dates where the failures increased obviously (March 16 and
March 28) . These 2 dates match with the dates where 2 MME were upgraded, so it is clear that
both operations impacted the accessibility in the same way.
MME KPI shows that S1 mode failure times of UE-initiated service request (#10 implicit
detached) reduced considerably and S1 mode failure times of UE-initiated service request (#111
protocol error, unspecified) increased.
Fig 24 - S1 mode failure times of UE-initiated service request (#10 implicit detached)
Fig 25 - S1 mode failure times of UE-initiated service request (#111 protocol error, unspecified)
The analysis shows that after the upgrade the parameter BIT5 of BYTE_EX45 automatically
changes from 0 to 1. In the next figure can be noticed the effects that can happen if this switch
is modified.
So, the solution for this issue was to change this parameter back to 0. Once the change was
applied, the performance recovered.