Keywords: The integrity of cement in cased boreholes is typically evaluated using well logging. However, well logging
Well log interpretation results are complex and can be ambiguous, and decisions associated with significant risks may be taken based on
Machine learning their interpretation. Cement evaluation logs must therefore be interpreted by trained professionals. To aid these
Deep neural networks
interpreters, we propose a system for automatically interpreting cement evaluation logs, which they can use as a
Cement evaluation
Well integrity
basis for their own interpretation. This system is based on deep convolutional neural networks, which we train in
Well logging a supervised manner using a dataset of around 60 km of interpreted well log data. Thus, the networks learn the
connections between data and interpretations during training. More specifically, the task of the networks is to
classify the bond quality (among 6 ordinal classes) and the hydraulic isolation (2 classes) in each 1m depth
segment of each well based on the surrounding 13 m of well log data. We quantify the networks’ performance by
comparing over all segments how well the networks’ interpretations of unseen data match the reference in
terpretations. For bond quality, the networks’ interpretation exactly matches the reference 51.6% of the time and
is off by no more than one class 88.5% of the time. For hydraulic isolation, the interpretations match the
reference 86.7% of the time. For comparison, a random-guess baseline gives matches of 16.7%, 44.4%, and 50%,
respectively. We also compare with how well human reinterpretations of the log data match the reference in
terpretations, finding that the networks match the reference somewhat better. This may be linked to the networks
learning and sharing the biases of the team behind the reference interpretations. An analysis of the results in
dicates that the subjectivity inherent in the interpretation process (and thereby in the reference interpretations
we used for training and testing) is the main reason why we were not able to achieve an even better match
between the networks and the reference.
1. Introduction cement. Older cementing jobs may also be tested again to ensure that
they still hold, e.g. as a step in cost-effective plug & abandonment op
Cementing is a very common operation carried out during the con erations (Vrålstad et al., 2019). To date, the only method that can
struction phase of the majority of oil wells. The idea of cementing op confirm zonal isolation with certainty is a pressure test. However,
erations can be traced back to 1859 and 1871, with the first cement pressure tests may be economically unfeasible, and field experience
operation executed in 1883 by Hardison & Stewart Oil Company (Mau shows that they in some cases risk causing damage to the cement.
and Edmundson, 2015; Hill, 1871). Cementing operations have two Therefore, companies typically evaluate cement through well logging,
main objectives. The first objective is to provide well integrity by con where measurement tools are lowered into the casing string to check the
trolling flow in the well through hydraulic isolation between different presence and quality of the cement on its outside.
zones in the wellbore. Thus, successful cementing prevents fluids from Since Grosmangin et al. (1961) and Anderson and Walker (1961)
geological formations flowing into other geological zones or to the published the first effective cement evaluation method, which was based
surface. The second objective is to provide support for the casing. on sound waves, further tool development has been somewhat slow.
To ensure that a cementing job was successful, we must test the Even today’s techniques cannot unambiguously verify the presence of
isolating cement. Among the existing methods, however, acoustic log Fundamentally, interpreting cement logs is similar to interpreting
ging techniques are the most common and efficient according to open-hole logs. In both cases, the interpreter must understand the
Allouche et al. (2006). These comprise sonic and ultrasonic techniques physics and processing underlying different measurements, and the
(Grosmangin et al., 1961; Hayman et al., 1991; van Kuijk et al., 2005). limits thereof, so that they can weigh different measurements against
The data recorded by acoustic tools may be processed to get estimates of each other (Kyi and Wang, 2015). The interpreter must therefore be
various parameters that describe the state of well components such as trained in the art of borehole logging and imaging in order to reduce
the casing and the cement. The well’s potential for hydraulic isolation interpretation error. Furthermore, the interpreter must be familiar with
may then be interpreted from these results. the decisions that hinge on the interpretation, and the potential risks
However, this interpretation is a complex task with associated risks therein.
(Benge, 2014; Kyi and Wang, 2015), and it must therefore be carried out Unlike open-hole log interpretation, where nomenclature, rules, and
by trained professionals. They use their understanding to integrate the documentation is generally agreed upon, integrity interpreters lack a
various log results and their knowledge about the well to produce an general recipe for interpretation. Another challenge of cased-hole
evaluation of the cement status. We describe this process further in Sec. interpretation relates to the difficulty of correctly interpreting vertical
2.1. features. While interpreting azimuthal heterogeneities in cased holes
This task is performed under time pressure, as further well devel tends to be easier, vertical heterogeneities such as fluid channels raise
opment may hinge on the evaluation results. As Belozerov et al. (2018) questions such as whether the channels are connected, how long the
also point out, the process of well interpretation is complex, time channels are, and whether the integrity of that zone is compromised.
consuming, and quite subjective as well: Although no studies on this The interpreter must also consider possible reasons why the vertical
have yet been published, oil companies are well aware that different features exist in the first place. A problem of comparable complexity
interpreters can reach different conclusions from the same data. We from open-hole logging is the interpretation of drilling-induced fractures
consider this subjectivity further in Sec. 2.2. in images from spiral holes.
In the work presented here, we created and tested an automatic
system to generate well log interpretations from log data. Our goal is to 2.1.1. Interpretation of acoustic logs
help interpreters offset the aforementioned problems by using the sys Over the past decades, well integrity interpreters have mainly been
tem’s automatic interpretations as a baseline for manual interpretation. exposed to and familiarised with results from acoustic logging methods.
As the system is trained on previous manual interpretations, it may thus These results are extracted from raw acoustic waveforms through time
help keep future interpretations more consistent. The system would also or frequency domain-based processing (Pardue et al., 1963; Hayman
speed up the interpretation process if human interpreters only need to et al., 1991; Allouche et al., 2006). While laboratory tests can evaluate
correct its interpretations where it fails. While humans may always need the integrity of cement sheaths in detail (Albawi et al., 2014), the lim
to be part of the loop when important decisions are made based on well itations of current acoustic technologies mean that a much smaller
logs, such a system could also provide quick-look interpretations for less amount of information is extracted during logging. This information can
important wells. We give an overview of our approach to creating such be ambiguous, and interpreters must therefore often make strong as
an automatic system in Sec. 2.3. sumptions, for example that intervals with solids bonded to the outside
of the casing can be interpreted as isolated.
2. Background Acoustic logging is based on acoustic waves propagating in the well.
There are two subtypes: Sonic logging is lower-frequency (10–80 kHz)
2.1. Manual well interpretation and nondirective, while ultrasonic logging is higher-frequency (0.1–2
MHz) and directive. Both require tool immersion in a liquid-filled
When interpreting a cement evaluation log, the well integrity environment inside the casing that is free of debris or thin oil films.
interpreter looks at a plot of various log results against depth. (Fig. 2 They also require centralisation inside the casing. If any of these re
shows three examples of such plots.). The task is to explain the results by quirements is not completely fulfilled, the interpreter must take the
partitioning the well into zones, or intervals, answering two main resulting uncertainties into account while interpreting.
questions for each of them: ‘What is the bonding between annular solids Sonic tools’ monopole transmitters impinge omnidirectional pres
and the casing in this zone?” and “What is the zone’s potential for hy sure pulses onto the casing. There, they excite extensional waves (spe
draulic isolation?’ The integrity interpreter must have access to the log cifically, in the low-dispersive S0 Lamb mode) travelling up and down
results, either physically on paper or digitally through specialised soft the cases, as well as wavefields in the annulus and formation (Pardue
ware that can read well, plot, and process log data. The interpreter must et al., 1963; Tubman et al., 1984, 1986; Sinha and Zeroug, 1999; Wang
also have access to the well history, which provides context to the well et al., 2016). These wavefields in turn feed back into the wavefield in the
log (Benge, 2014). Table 1 shows part of an interpretation resulting from fluid inside the casing. This is recorded by two hydrophones at distances
such a task. of 3 ft (0.914 m) and 5 ft (1.524 m) from the transmitter. The main
feature extracted from this wavefield is the cement bond log (CBL) data
channel. CBL is the signal amplitude in mV of the component of the
Table 1 wavefield that arrives first at the closest hydrophone. This component
Extract of the official interpretation of the well Volve 15/9-F-9 after a log has propagated along the casing, as this is the fastest path from trans
operation in June 2009 (Equinor, 2018). mitter to receiver. As the casing wave continuously loses energy into the
Interval Interval Cement or formation Potential for hydraulic materials on both sides of the casing, and the loss into bonded solids is
top bottom bond quality isolation strong, lower CBL values signify solids bonded to the outside of the
[mMD] [mMD] casing. The CBL value can roughly be interpreted by thresholding. If the
150 178 Moderate to poor Low CBL is below around 10 mV, this indicates full azimuthal coverage of
178 188 Good to moderate Medium solids. If the CBL is around a certain high value (depending on casing
188 195 Moderate Medium size and normalisation, though typically 50–60 mV), this indicates free
195 201 Good to moderate Medium
201 221 Poor Low
pipe, i.e. no solids bonded to the casing. As CBL varies with depth, it is
221 332 Poor Low typically plotted as a curve. The full recorded waveforms for every depth
332 341 Moderate to poor Low can also be plotted as an image, with depth along one dimension and
341 440 Moderate to poor Low waveform samples along the other. This is called a variable density log
440 451 Moderate Medium
(VDL), and this data channel can be used for further qualitative
Fig. 2. Log plot of data channels from three sections from Volve 15/9-F-11 B. The plotted quantities are explained in Sec. 3.2, with the exception of IRMN and IRMX
(minimum and maximum casing inner radius), ERAV (average casing outer radius), and THMN, THAV, THMX, and THNO (minimum, average, maximum, and
nominal casing thickness). The G/L/S distr. column estimates the share of gas (red), liquid (blue), microdebonded solids (green) and solids (brown), mainly by
thresholding impedance as described by Allouche et al. (2006). (For interpretation of the references to color in this figure legend, the reader is referred to the Web
version of this article.)
with a standard upper limit of 7.5 MRayl, we found this range directly integrity interpretation is a very difficult problem. Another difficult
from the data.) We observe a large vertical channel of low impedance at problem would be to write a computer program that duplicates the
the top of the casing (image sides), which indicates fluid pockets inter decision-making process of a human interpreter. Firstly, it would require
connected for the length of the section. The impedance image also shows a complete understanding of that process for every case that that may
traces of possible formation footprint as higher-impedance horizontal come up when interpreting an integrity log. Secondly, it would require
features at 2801 m and 2815 m, though it cannot be concluded that the translating this process into code.
formation provides bonding support. The CBL value is moderate Another path towards an automatic interpretation system is machine
throughout the section, while the VDL shows formation arrivals learning, in particular supervised learning, where an algorithm is
throughout the section. trained on already-interpreted data. A major advantage of supervised
Even taking eccentering-related uncertainties into account, the sonic learning is that we do not need to implement the decision-making pro
and ultrasonic measurements lead to an interpretation of moderate bond cess ourselves. Instead, we show interpreted data to the supervised
quality and low potential for hydraulic isolation due to the presence of learning algorithm, and it learns by itself the connections between
the top channel. certain types of log data and their interpretations. If done correctly, the
Section 3112–3132 m: From the theoretical ToC, we can expect algorithm is then able to make reasonable interpretations of data that it
cement to exist in this section. The QC situation is similar to the previous did not see during training.
sections, except that the eccentering is now largely within the recom Research on machine learning on well log data is currently taking off,
mended maximum. with many papers on the topic published at the SPWLA 60th Annual
Again, the impedance is heterogeneous, varying from 2 to 9 MRayl. Logging Symposium in 2019 (Bennis and Torres-Verdín, 2019; Bigoni
The impedance is generally high and CBL generally low, which indicates et al., 2019; Dai et al., 2019; Gupta et al., 2019; Jain et al., 2019; Li et al.,
good solid coverage, with the exception of a liquid pocket at 3126–3131 2019; Liang et al., 2019; Oruganti et al., 2019; Peyret et al., 2019; Shao
m. This pocket is visible on the ultrasonic, CBL, and VDL logs. The et al., 2019; Wu et al., 2019). Only one paper on the topic had previously
impedance also contains a heterogeneous vertical feature which may been published at the symposium (Akkurt et al., 2018).
indicate two phases of solid deposition at different times. However, Interesting papers on the topic have also appeared elsewhere. For
there is little to no chance of a liquid channel along the ultrasonic image example, Onalo et al. (2018) used neural networks to recreate open-hole
outside the liquid patch. sonic logs from other open-hole log data for cases where reliable sonic
The ultrasonic and sonic measurements lead to an interpretation of logs were not available, Belozerov et al. (2018) used neural networks to
high bond quality and high hydraulic isolation in this section, except for identify oil reservoirs from well log data, and Gkortsas et al. (2019) used
the localised liquid pocket at 3126–3131 m, which should be interpreted support vector machines and neural networks to automatically identify
further in the context of the entire log. an ultrasonic waveform feature that can give additional information on
the P-wave speed of the annular material in cased boreholes.
2.2. Subjectivity in interpretation However, using machine learning to replicate manual interpretation
of cement quality is a quite difficult problem. In particular, it is difficult
In many fields, different experts can look at the same data and come because machine learning typically requires a lot of data, and sufficient
to different conclusions. For example, two medical doctors may come up amounts of interpreted log data is hard to come by outside of oil com
with two different medical diagnoses based on the same signs and panies. Additionally, the log data can be quite heterogeneous: Different
symptoms, or two psychiatrists may make different diagnoses after log runs can use different tools whose measurements cannot be
having listened to the same clinical interview. Two components of this compared directly, and even the same tool can have different values of
variability are often considered (Popovi�c and Thomas, 2017): settings such as resolution across different runs.
Interobserver variability, the tendency of different observers to The task of interpreting the well status based on well log channels is
make different judgements when interpreting the same data. very similar to the general task of image classification. In both tasks,
Intraobserver variability, the tendency of a single observer to make periodically sampled data is analysed in order to classify it according to
different judgements when interpreting the same data multiple times. what it contains. Classification of photos has been very extensively
In well log interpretation, this variability can manifest itself in studied over the last decade, with current best approaches based on
several ways. Interpreters may disagree on which label to apply to a well convolutional neural networks (CNNs) (Russakovsky et al., 2015;
interval, for example if one interpreter sees inhomogeneous solids Chollet, 2018). For this reason, the work presented here is also based on
around the entire cross-section, whereas another sees a fluid channel CNNs. CNNs are also widely used for image classification tasks in other
through cement. They may also disagree on where to place the bound fields, such as medicine (Anthimopoulos et al., 2016; Cheng and Malhi,
aries between interpreted intervals, and how fine to make their in 2017; Østvik et al., 2019). In our case, however, the task is somewhat
terpretations. For example, when interpreting a long stretch of patchy more difficult, as we do not classify the well status from single images.
cement, one interpreter may use a single long interval, whereas another Rather, we must use a heterogeneous collection of image and curve data
may use many intervals, one for each patch and one for each stretch channels for our classification.
between patches.
This variability can be partly offset by having a team of interpreters 3. Data and methods
collaborating on interpretation tasks, although this further increases the
time an interpretation requires. For the dataset used in this article, the 3.1. Datasets
collaboration approach was to have a first interpreter perform a com
plete interpretation of each log, and then run each interpretation The dataset underlying this work is the combination of two indi
through a quality control process where one or more highly experienced vidual datasets. The first is the public Volve Data Village dataset from
individuals in a team of interpreters examine and correct it. (However, Equinor (2018). It contains, among a great deal of other data, inter
while the members of a team may develop a common understanding of preted cement evaluation log data from three wells in the Volve field.
how to interpret integrity, this understanding may not be universal, and These were recorded from 2009 to 2016. The second dataset contains
the team’s approach may differ from that of another team.) interpreted cement evaluation log data from 29 wells in another field.
The data was recorded from 2009 to 2012.
2.3. Automatic interpretation through supervised learning In total, we have official interpretations from 54 logging operations,
all from the same team of interpreters. The interpretation process is
From the preceding sections, we see that making an accurate well described in Sec. 2.1, and an example extract of a manual interpretation
is shown in Table 1. The shortest and longest interpreted intervals in our ordered, i.e., they form an ordinal scale. For HI, original ‘High’ and ‘Yes’
dataset are 1m and 1783 m long respectively, and the median interval is labels were translated to ‘Yes’, and others were translated to ‘No or
33.5 m long. Where log data was available, we associated each inter uncertain’. (We did not distribute HI more finely as the resulting classes
preted interval with one or more log data files containing the data that would have become very unbalanced due to the scarcity of intermediate
was interpreted. Repeat passes were included wherever visual compar labels.) We excluded log reports that did not provide interpretations of
ison of the data showed that their calibration and alignment matched the both BQ and HI.
main passes. Thus, our 54 interpretations were associated with 99 data
files that together contain 62594 m of interpreted log data. 3.3.2. Well log data
The well log data files were provided in the archaic yet widely-used
DLIS file format (API, 1991). As DLIS files would be too slow to
3.2. Logging tools and data channels
continuously read from during training, we mirrored the files’ contents
to an HDF5-based format using the Java library Log I/O, developed by
In our dataset, all log data files contain data from a sonic and/or
Petroware. During training, we can read directly and efficiently from
ultrasonic tool, as described in Sec. 2.1.1. The sonic tool used is mainly
these HDF5 files.
Schlumberger’s Digitizing Sonic Log Tool (DSLT), and the ultrasonic tool
A data channel may be provided at different depth resolutions in
is mainly Schlumberger’s Ultrasonic Imager Tool (USIT), which uses the
different log data files. However, our neural networks needs to have its
pulse-echo technique described in Sec. 2.1.1.
input data at a consistent resolution. For that reason, we defined a target
Each tool provides many data channels containing measured or
resolution for each data channel. The data of each channel then has to be
processed data. In a log file, each data channel is sampled at a constant
transformed from its original resolution to the target resolution in as fast
depth resolution. Curve channels hold only a single value per depth,
a manner as possible, so that this transformation will not slow down the
while image channels hold multiple values (representing data over, e.g.,
neural network training. Where the original resolution is higher and an
a set of azimuthal angles or the samples of time signals) per depth.
integer multiple of the target resolution, we stride through the data
Example data for the most important channels are shown in Fig. 2, and
channels. Elsewhere, we use nearest-neighbour interpolation in depth.
the channels used as input to our neural networks are:
Unlike other basic interpolation methods, nearest-neighbour interpola
CBLF: Sonic curve channel; corrected cement bond log.
tion naturally handles NaN values, which data channels frequently use
VDL: Sonic image channel; received sonic waveforms.
to represent missing data.
ECCE/AZEC: USIT curve channels; eccentering magnitude and
3.4. Data organisation
AWBK: USIT image channel; amplitude of returned pulse.
IRAV: USIT curve channel; average casing inner radius.
We split all of the interpreted intervals into depth segments of 1m
IRBK: USIT image channel; deviation of casing inner radius from
length. We then tied each segment to the 13m interval of log data sur
rounding it, as shown in Fig. 3. This ensures that the automatic inter
T2BK: USIT image channel; deviation of casing thickness from its
pretation of each segment is supported by log data also outside the
nominal value.
segment. This choice for example makes it possible to differentiate be
AIBK: USIT image channel; acoustic impedance of material behind
tween fluid patches and fluid channels when interpreting HI. We chose
the casing.
the interval length as 13 m to ensure that each log data interval contains
UFLG: USIT image channel; QC flags indicating where USIT pro
at least one casing collar, with casing joints being around 12 m long.
cessing fails.
In total, we have 58781 samples in the dataset, where each sample i
UCAZ/RB: USIT curve channels; rotation of USIT images.
GR: Gamma ray curve channel; local radioactivity from formation. represents the interpretation labels yBQ HI
i ; yi of a 1 m segment and data Xi
For logs measured using Schlumberger’s Sonic Scanner instead of the from the corresponding 13 m log data interval. We partitioned these
DSLT, we used its discriminated synthetic CBL (DCBL) curve channel as samples into six folds. To ensure that no two folds contain information
a drop-in replacement for CBLF. from the same well section, we ensured that all logs with the same well
A problem with a dataset such as ours is that it is very heterogeneous. and casing size are placed in the same fold. We optimised this parti
The depth resolution of each channel can vary from log run to log run. tioning using simulated annealing to ensure that the folds were of
Some channels are also missing in some log files, typically (but not al approximately equal size, with approximately the same distribution of
ways) because the tool producing it was not present on the tool string. classes for both BQ and HI. Table 2 shows the distribution of each fold.
One logging operation also had to be excluded as it only used a slim
sonic tool whose CBLF values were not properly normalised. 3.5. Accuracy and metrics
Table 2
Distribution of the samples in each of the six folds among classes of bond quality and hydraulic isolation.
Bond quality (BQ) Hydraulic isolation (HI)
Fold No. of samples Good Moderate to good Moderate Poor to moderate Poor Free pipe Yes No or uncertain
Total 58 617 13 216 2836 5449 5646 13 116 18 354 16 931 41 686
1 X
N 1
BAAðy; by Þ ¼ PN 1 y BQ
1 b BQ
i � yi wi : (1d)
i¼0 wi i¼0
These metrics ensure that the results of every class has the same
weight. The unbalanced accuracy metrics in (1a) and (1b) do not, and
therefore emphasise the more common classes, namely ‘Good’, ‘Poor’,
and ‘Free pipe’ for BQ and ‘No or uncertain’ for HI.
In summary, we use four accuracy metrics: Unbalanced precise ac
curacy (UPA), balanced precise accuracy (BPA), unbalanced adjacent
accuracy (UAA), and balanced adjacent accuracy (BAA).
To show the distribution of labels in more detail, we also use
balanced confusion matrices, which show the joint distribution of the
reference labels yi and the predicted labels b
yi :
1 X
N 1
CMk;l ðy; by Þ ¼ PN 1
1ðyi ¼ Ck ; by i ¼ Cl Þ: (1e)
i¼0 1ðyi ¼ Ck Þ i¼0
The vertical axis indexes the classes Ck for the reference, while the
Fig. 3. Each 1 m interpreted well segment is tied to a 13 m log data interval horizontal axis indexes the classes Cl for the prediction. Balancing in this
surrounding it. Segments that are close together, such as the two shown here,
way, all matrix rows sum up to 100%.
will use partly overlapping intervals of log data.
Now, what kind of accuracy can we expect? As a lowest baseline for
our classification problems, we have a classifier that simply guesses
1 X randomly. For BQ, this would give a precise accuracy of 16.7% and an
N 1
UPAðy; by Þ ¼ 1ðb
y i ¼ yi Þ : (1a)
N i¼0 adjacent accuracy of 44.4%. For HI, the corresponding precise accuracy
is 50%. On the high side, we cannot expect accuracies close to 100%; due
Here, 1ðxÞ is an indicator function which equals 1 if its argument x is to the subjectivity discussed in Sec. 2.2, this would be unattainable even
true and 0 otherwise. For BQ classification, we also use the adjacent by human interpreters. Instead, there must be an upper accuracy
accuracy threshold, related to the subjectivity inherent in the manual labels on
1 X
N 1
� which we train and test.
UAAðy; by Þ ¼ y BQ
1 b BQ
i � yi ; (1b)
N i¼0
3.6. Manual reinterpretation
y i � yBQ
where b holds true if the predicted label b
y i is off by no more
To shed some light on this subjectivity, we arranged a manual
than one class from the reference label yBQ i . The adjacent accuracy is reinterpretation of every main log pass in fold 2 of our dataset, with the
therefore the proportion of samples for which this holds. The idea of goal of comparing this reinterpretation with the official interpretation in
adjacent accuracy is that, for example, predictions of ‘Moderate to good’ our dataset. We gave this task to a well integrity researcher with a
or ‘Poor to moderate’ may still be close enough to a reference label of decade of experience in well logging engineering and research, who is
‘Moderate’ to be useful. (For HI, using adjacent accuracy does not make not part of the team behind the official interpretations. He carried out
sense as there are only two classes; the adjacent accuracy would have his interpretations directly on the log data files without first having seen
been 100% in every case.) the official interpretation. To display and process the log data files in
We also report these accuracies in balanced form, using the same order to make his interpretations, he used the WellCAD software by
definitions as the scikit-learn library by Pedregosa et al. (2011). Advanced Logic Technology. (Similar software like Techlog by
Balanced accuracy compensates for the imbalance of classes in the Schlumberger and Geolog by Emerson E&P Software could also have
P been used for the same task.)
dataset, seen in Table 2, by weighting each sample as wi ¼ 1= 1ðyj ¼
3.7. Baseline method setup
yi Þ , i.e. inversely proportional to its class’ prevalence in the dataset:
The random baseline described in Sec. 3.5 is not a very interesting
1 X
N 1 lower baseline, as any working classifier can beat it. A more interesting
BPAðy; by Þ ¼ PN 1 1ðby i ¼ yi Þwi (1c) baseline to compare the neural network results to would be a classifier
i¼0 i¼0
using a very simple approach. For this baseline, we classified the BQ and
HI parameters in each 1 m well segment i by a simple thresholding of the
CBLF channel’s median value CBLF~ i inside the segment. We chose CBLF
as the input as it is a simple channel that provides good overall infor network, the 10 different data channels are stacked like color channels
mation on the well status. in an image. The 1D branch receives the CBLF and GR channels at a
For an interpretation parameter with K classes C0 ;…;CK 1 , we define resolution of 6 in (15.24 cm), and the VDL branch receives the VDL
an ordered sequence of thresholds T0 ; …; TK , where T0 ¼ 0 mV, channel at a resolution of 4 in (10.16 cm) and 5 μs. Because the number
TK →∞ mV, and Tk < ¼ Tkþ1 . If Tk < ¼ CBLF ~ i < Tkþ1 , we assign class of time samples in VDL channels varies across files, we trimmed all
label Ck to segment i. To find these thresholds with a simple and virtually channels to 240 time samples (1.2 μs), the lowest common value. VDL
parameter-free method, we employed decision trees, specifically the channels with fewer samples (provided by uncommon tools) were
implementation in the scikit-learn library by Pedregosa et al. (2011). We discarded.
balanced the classes and specified the maximum number of leaf nodes as All data channels are normalised individually before input, to a mean
K. (K ¼ 6 for BQ, and K ¼ 2 for HI.) value of 0 and a standard deviation of 1, to avoid channels with higher
We found the baseline results using 6-fold cross-validation as shown values being weighted more. Missing data channels and data channel
in Fig. 4(a). We held out one fold at a time for testing, and used the values are replaced with zero-values. To augment the data, we exploit
remaining folds to fit a decision tree. For each test fold, we compared the the periodic azimuthal symmetry of the USIT images by rolling and
baseline interpretations with the official interpretations as described in flipping the images in angle. When doing so, we also change the values
Sec. 3.5. After finding results for each test fold in this way, we used the of angular curve channels (AZEC, UCAZ, and RB) accordingly.
metrics’ mean value as our overall result. The network setup follows recommendations by Chollet (2018).
Each branch contains convolutional layers and maxpooling layers, and
we tuned their size and number based on our accuracy metrics. The
3.8. Neural network setup convolution kernel sizes are 3 � 3 for the USIT branch, 7 for the 1D
branch, and 5 � 5 for the VDL branch. The poolings of the maxpooling
All of our data channels are regularly sampled along a depth axis. layers are 2 � 2 in the USIT branch, 2 in the 1D branch, and 2 � 4 in the
Some data channels are 2D, being regularly sampled in time (the VDL VDL branch. As recommended by Chollet (2017, 2018), we used
channel) or azimuthal angle (the AWBK, IRBK, T2BK, AIBK, and UFLG depthwise separable convolution layers. These use a representionally
channels) as well. For this type of data, convolutional neural networks efficient convolution approach that separates spatial and channel
have been found to be very effective, and our network setup is therefore convolution kernels, and gave us better results than conventional con
based on these. Our networks are implemented in Keras (Chollet et al., volutional layers. The three branches are merged after global average
2015) with the TensorFlow backend (Abadi et al., 2015). pooling, whereupon densely connected layers are used for classification.
As we are using 13 different data channels of different types, it is The convolutional and dense layers use the ReLU activation function. To
natural to use multiple inputs in the neural network setup. However, combat overfitting, the dense layers use a dropout of 0.5, whereas the
using one convolutional branch for each channel would be very convolutional layers use a spatial dropout of 0.2. We used the RMSprop
computationally expensive. For that reason, we divide the channels optimiser with a learning rate of 0.001 and a training batch size of 16
across three branches, shown in Fig. 5, according to their dimension samples. To balance the classes and because training often quickly
ality, source tool, and depth resolution commonality. reaches its highest accuracy (see Sec. 4.2), we defined an epoch as
The USIT branch receives the USIT image channels at a resolution of consisting of 3000 samples drawn equally from every class.
3 in (7.62 cm) and 5� . We upsample channels originally sampled at 10� Conventionally, neural network classifiers for problems with K
by nearest-neighbour interpolation, for the same reasons as described in classes use softmax activation and categorical crossentropy loss with an
Sec. 3.3.2. The branch also receives the USIT curves, which we array output vector of length K. Element k’ of the target vector Y is Yk’6¼k ¼ 0,
broadcast to the same shape as the image data. When provided to the Yk’¼k ¼ 1, where Ck is the manually labelled class for the interval
Fig. 4. Usage of folds for (a) the baseline case, which uses 6-fold cross-validation, and (b) the neural network case, which uses a form of ensembled cross-validation.
For each test fold in the latter case, 5 networks were trained using different validation folds. Together, these networks form an ensemble that was tested on the
test fold.
Fig. 5. Setup of the neural networks, from the three input branches to an output whose size is given by the number K of classes.
(Chollet, 2018). However, this approach implies that the classes are For training and evaluation, we used a form of ensembled cross-
nominal and would ignore the ordinal nature of our BQ classes. Instead, validation, shown in Fig. 4(b). As we did for the baseline, we hold out
we used an approach like that of Cheng et al. (2008), with a ðK one fold at a time for testing. Here, however, we also hold out one fold at
1Þ-length target vector where Yk’<k ¼ 1 and Yk’�k ¼ 0, using sigmoidal a time for validation in order to select the best-performing network state
activation and binary crossentropy loss. From the first k’ in the output during training, i.e., the epoch with the highest balanced precise accu
vector for which Yk’ < 0:5, we determine the predicted class as Ck’ . This racy on the validation fold. Thus, for each test fold, we use the five
choice of target vector ensures that the loss function is lower the closer different validation folds to train five BQ and five HI networks. The five
the class prediction is to the manually labelled class. We observed that networks in each group are ensembled to combine their predictions. We
this choice increased the BQ accuracy of our network. used the median of the class predictions of the ensemble’s component
Fig. 6. Bond quality confusion matrices for the baseline method, showing the six test folds in reading order. Each matrix is normalised so that each row sums up to
100%. The numbers above each matrix represents its accuracy metrics UPA/BPA/UAA/BAA.
4. Results
We will first discuss the results of the baseline method, the neural
network method, and the manual reinterpretation individually, before
we compare them qualitatively for a specific well log in Sec. 4.4.
Fig. 9. Bond quality confusion matrices for the neural networks, plotted as in Fig. 6. Each confusion matrix is an average of the results of five repetitions.
Fig. 11. Overall confusion matrices for the neural networks, found by aver
aging the results from all test folds, for bond quality (left) and hydraulic
isolation (right).
Fig. 10. Hydraulic isolation confusion matrices for the neural networks,
plotted as in Fig. 7. Each confusion matrix is an average of the results of five Table 4
repetitions. Comparison of the overall BQ and HI results of three methods: Expected values
from random guessing (RND), the baseline method (BL), and neural networks
calculate a network root mean square deviation as (NN). We only report standard deviation for the neural networks, as it is the only
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi method involving stochasticity.
σ net
m ¼ ½am ðv; t; rÞ am ðv; tÞ�2 : (2) BQ HI
6 t 5 v6¼t 5 r Accuracy metric
The results for all accuracy metrics are shown in Fig. 12. UPA [%] 16.7 44.0 51.6 � 0.8 50 81.2 86.7 � 0.3
Similarly, we can look at the variation in accuracy between ensem BPA [%] 16.7 43.5 46.7 � 0.7 50 82.9 86.6 � 0.3
UAA [%] 44.4 80.5 88.5 � 0.2
bles tested with the same folds. If am ðt; rÞ is the mth accuracy metric for – – –
BAA [%] 44.4 81.3 89.1 � 0.4 – – –
repetition r of training an ensemble with test fold t and am ðtÞ ¼
5 am ðt; rÞ is the mean over all repetitions, then we can calculate an
ensemble root mean square deviation as
Fig. 14. Log plot of data from Volve 15/9-F-11 B, plotted as in Fig. 2, but with two extra columns comparing four interpretations of BQ and HI: The official
interpretation (OF), the neural network interpretation (NN), the reinterpretation (RE), and the baseline interpretation (BL).
the interpretations related to the CBLF spikes caused by casing collars. 5.1.1. Data size
In general, both the neural networks and the baseline method give a Perhaps the dataset is simply not large enough compared to the
good match with the official interpretation in the upper 3/5 of the log, variation in the data? For example, some well or measurement condi
except that both predict hydraulic isolation in a section with extensive tions may be rare enough in our dataset that the networks are unable to
channeling, which is a simple mistake to catch for a human correcting discern meaningful relationships between data and labels from the
the automatic log. In the lower 2/5, however, both methods struggle to available interpreted data.
capture the official interpretation, probably due to the unusual combi The question is thus whether accuracy would increase with more
nation of high CBLF values and good impedance readings. data. While we have already used all the data we have available, we can
turn the question on its head and investigate whether accuracy would
5. Discussion decrease with less data. To do this, we reduced the size of the dataset by
removing entire log operations from the folds while retaining the class
The accuracy metrics show that both the baseline and neural net balance as well as possible. The networks were then trained and vali
works perform quite well, with the neural networks giving a 3–8% dated on the reduced folds and tested on the full folds.
improvement over the baseline on every accuracy metric. It is somewhat The overall BQ results shown in Fig. 15 for different reductions
surprising that the baseline still performs as well as it does while only shows that even using as little as 38% of the dataset does not have a very
being based on the median of the CBLF channel within each segment. strong impact on the accuracy. Thus, while having more data may
However, as we discussed in Sec. 2.1.1, the CBLF channel, which is all reduce the variations between folds, the current performance does not
the baseline method has access to, contains much of the same infor seem to be primarily limited by the size of the dataset.
mation as the acoustic impedance channel, which the neural networks
also have access to. The results thus indicate that a simple thresholding 5.1.2. Subjective labelling in the dataset
of CBLF can be sufficient to make a decent interpretation, even though In Sec. 2.2 we considered subjectivity in interpretation tasks, and
this can be improved significantly by also using information from other more specifically inter- and intraobserver variability. The manual rein
channels. It may also indicate that CBLF forms the backbone of the terpretation described in Secs. 3.6 and 4.3 and discussed at the begin
official interpretations in this dataset, which is consistent with what we ning of the current section shows that the interobserver variability can
know about the process behind these interpretations. be very strong in well log interpretation. We consider it likely that the
The manual reinterpretation gives an interesting perspective on official interpretations are also affected by interobserver (and possibly
interpretation bias. While there is no objective ground truth available to intraobserver) variability. In other words, our dataset may contain very
give us an objective reference for interpreters’ biases, we can at least similar input data samples that have been interpreted differently. In a
compare different interpreters’ relative biases with each other. The case of two instances of similar data with different interpretations, a
confusion matrices in Fig. 13 indicate that the official interpreter team network trained on the first interpretation would have a difficult time
has a positive bias compared to our reinterpreter, or equivalently, that predicting the second interpretation and vice versa. Additionally, a
the reinterpreter has a negative bias compared to the official interpreter network trained on both would get mixed messages during training.
team. The neural networks, however, have been trained on the official To investigate whether this effect limited our performance, we
interpretations and therefore tends to share the same bias. This may be looked more closely at who performed the interpretations in our dataset.
the main reason that the neural networks outperform the reinterpreter While the interpretations were to some degree produced as a team effort,
when using the official interpretation as a reference. This also un the reports containing them also specify who the first interpreter was.
derscores that it is important for the training dataset to be thoroughly He or she performed the initial interpretation, before it underwent
quality controlled interpreted log data, deliberately chosen to represent quality control by other interpreters.
a reference for automatic interpretation. To investigate the effect of interobserver variability, under the
Despite the promising results from the neural network, it is impor assumption that this is not completely eliminated by the quality control
tant to analyse which factors hold back its performance. In the following process, we tried to reduce this variability by selecting a subset of our
sections, we will discuss some factors that may limit the performance of dataset with the most commonly used first interpreter. (This subset
our automatic interpretation system, and discuss other possible
represents 45.1% of the total dataset.) We divided this subset into six median. We performed this procedure separately for BQ and HI.
folds and trained and tested neural network ensembles for BQ on the Table 5 compares the original results shown Table 4 with results
folds in the same way as before. calculated from these interval-restricted automatic interpretations. We
Fig. 15 shows significantly better results when all interpretations find that forcing our automatic interpretations into the coarser depth
share the same first interpreter. We see particular improvement in the intervals used by the manual interpreters improves every accuracy
balanced accuracies, which weight the rarer intermediate categories metric by around 23%.
‘Moderate to good’, ‘Moderate’, and ‘Poor to moderate’ more than the This result shows that the fine-grained nature of the automatic in
unbalanced accuracies. We would expect these intermediate categories terpretations limits the reported accuracy when tested against coarse-
to be more subjective than the more clear-cut categories ‘Good’, ‘Poor’, grained manual interpretations. However, it also has implications for
and ‘Free pipe’. Thus, our results indicate that the performance is limited the training process, which is based on the same manual interpretations.
by some degree of interobserver variability. To quantify interobserver If the manual interpretations are often too coarse-grained, this would
and intraobserver variability further, however, a dedicated study would make it more difficult for the networks to learn relationships between
be necessary. data and labels. As a hypothetical example, consider a coarse-grained
It may also be possible to reduce this subjectivity by using a different manual interval with BQ labelled as ‘Moderate’ that also contains a
annotation system for the well log interpretations. For example, the smaller subinterval that would have been labelled ‘Poor’ if the manual
system used for annotating bond quality uses a rating scale from ‘Good’ interpreter had been requested to perform a finer-grained resolution.
to ‘Poor’ (as well as ‘Free pipe’, which may at times be difficult to During training, the neural networks would be taught that segments
separate from ‘Poor’), which is inherently opinion-based. An annotation similar to those inside this subinterval should be labelled ‘Moderate’
system aiming for a more objective description of the distribution of instead of ‘Poor’. This would give the networks mixed messages when
material behind the casing may be able to result in more consistent similar segments in other parts of the dataset are labelled ‘Poor’, likely
annotations, although there may still be disagreements as to what that reducing the networks’ performance.
distribution is.
5.2.3. External information
5.2. Other possible causes of reduced performance The human interpreters may have access to information beyond what
the data channels provide. For example, the well history can tell them
Beyond the factors that may cause data heterogeneity, other factors where to expect the top of cement, as we saw in Sec. 2.1.2. The well log
may also have limited our performance. history may also tell them if some data channels should not be trusted
uncritically, for example due to logs being run with an improper tool
5.2.1. Network setup setup. The automatic system, on the other hand, only has access to the
It could also be argued that the network setup is not ideal. However, information present in the data channels. According to Benge (2014),
we experimented with a number of variations, including reducing and such information can be quite important to the interpretation process.
increasing the capacity of the network. These changes most often did not However, there is a large variety of possibly useful external infor
affect the accuracy significantly, although some changes gave slightly mation, and it is is largely not present in machine-readable form in our
negative effects. For example, while we could expect that choosing a dataset. It is therefore difficult for us quantify the importance of having
lower learning rate would help reduce the accuracy variation shown in such information available. In any case, however, a human interpreter is
Fig. 12, our tests indicated that it did not, but instead reduced the overall needed to verify an automatic log interpretation. Part of that task would
accuracy slightly. This indicates that performance may mainly be be to compare it with any such information that might be available.
limited by other factors than the network setup. Thus, the networks not having such information available would not be
a major problem in this workflow.
5.2.2. Differences in interval size
Manual interpretation often defines quite large depth intervals with
5.3. Estimates of confidence
the same interpretation. It can be argued that some of these intervals are
coarse-grained and could be divided into multiple subintervals with
When using an automatic interpretation system like the one
different labels for BQ and HI. The two manual interpretations in Fig. 14
described in this article in practice, it would be useful if the system gave
show this effect to some degree. The automatic interpretation, on the
an estimate of its confidence in its interpretations. For example, an in
other hand, is in principle free to label each 1 m segment differently. In
terval marked with low confidence could warrant closer scrutiny than an
practice, both the baseline and the neural networks end up with in
interval marked with high confidence.
tervals (by which we mean a series of consecutive segments with the
However, while it is straightforward to get confidence estimates from
same interpretation) that tend to be smaller than the manual in
conventional neural networks performing nominal classification, these
terpretations. This discrepancy can reduce the match between the
confidence estimates are often not useful unless found using special
manual and automatic results.
techniques (Guo et al., 2017; DeVries and Taylor, 2018). Additionally,
To investigate the interval sizes, we looked at the length of inter
for the ordinal classification technique that we use to improve perfor
preted intervals in the official manual interpretations used for training
mance as described in Sec. 3.8, Beckham and Pal (2017) explain that
and testing, and the resulting neural network interpretations. Looking at
estimating confidence is less straightforward. Additionally, the
BQ, the median official and neural network interval lengths are 29 m
and 8 m, respectively. For HI, they are 26.5 m and 17 m. This shows that
Table 5
the neural networks do tend to perform a finer-grained interpretation of
Comparison of the original neural network accuracy metrics and the metrics
the well than human interpreters.
found when using the median neural network interpretation within each manual
This limits our accuracy metrics, as they are calculated through depth interval.
segment-by-segment comparisons between the finer-grained automatic
interpretations and the coarser-grained manual interpretations. To get
an idea of how much these differences in interval size reduce the Acc. metric Original Median Original Median
interpretation accuracy metrics, we tried forcing the neural network UPA [%] 51.6 � 0.8 54.0 � 1.6 86.7 � 0.3 89.5 � 0.5
interpretations to use the same depth intervals as the reference in BPA [%] 46.7 � 0.7 50.0 � 2.8 86.6 � 0.3 88.6 � 0.7
terpretations. For all segments inside each official depth interval, we UAA [%] 88.5 � 0.2 90.8 � 0.4 – –
BAA [%] 89.1 � 0.4 91.3 � 0.6
found the median of the automatic labels and set all labels to this
– –
E.M. Viggen et al. Journal of Petroleum Science and Engineering 195 (2020) 107539
Alternative Proxies: