PARSIMONIOUS MACHINE LEARNING MODELS IN REQUIREMENTS ELICITATION TECHNIQUES

elicitation

Introduction.Business analysis as an extension of requirements engineering is crucial to software development.The main business analysis deliverables are requirements and designs used as a basis for solution implementation, testing, and deployment.In turn, the critical input for the tasks of analysis, specification, and modeling of requirements and design for software is the information collected during the elicitation.Standard approaches to the requirements-gathering process have been systematized and described in the form of dozens of standard elicitation techniques.Industrial guidelines and empirical studies contain detailed descriptions of the techniques' elements and usage considerations but do not provide an elicitation selection process [1].
Consequently, one of the challenges for business analysts/requirement engineers, especially novice ones, is the selection of the appropriate requirements elicitation techniques that best fit their project.As a result, some of them are misused, others are never used, and only a few are constantly applied.To solve the problem, a machine learning model to predict/recommend using the following elicitation techniques as Interviews, Document Analysis, Process Analysis, and Interface Analysis depending on the project's context was proposed [1].
In the study [2], the model's prediction accuracy was increased by transforming the dataset from imbalanced to balanced, thus making a Random Forest Classifier learner unbiased to the majority class.Feature importance score was identified by mutual information criteria, i.e., independent from the machine learner classifier.It served as an assurance that the feature's score doesn't depend on the learning algorithm's bias.Ten features with the most significant importance score were reported in tables 4-5 as predictors for choosing the elicitation technique.
However, in both [1] and [2], selecting the best model from the candidates remained based on the performance metrics such as Accuracy and AUC.
A model selected that way is also called a "best-fit" model.The "best fit" model is complexit includes many parameters in order to better approximate training data.The more variables included in a model, the more dependent the model becomes on the observed data so that it can fail on the test data due to noisy, uninformative, and unrepresentative data being included in the model.i.e., a "best-fit" model is prone to overfit data [3].
Although the "best-fit" models included twenty features, we took ten features with the most significant importance score, which potentially may be incorrect if the model due to include less than ten features.
To eliminate the mentioned problems for the model proposed in works [1][2] in the current study, we will develop a parsimonious model that still accurately predicts/recommends using the techniques: Interviews, Document Analysis, Process Analysis, and Interface Analysis.
Analysis of last achievements and publications.The principle of parsimony suggests a model should be as simple as possible concerning the included variables, model structure, and several parameters.It is a desired characteristic of a model defined by a suitable trade-off between squared bias and variance of parameter estimators [4].The construction of the parsimonious model happens in the following steps: A stepwise selection is based on a statistical algorithm that checks for the "importance" of variables and either includes or excludes them based on a fixed decision rule.The "importance" of a variable is defined in terms of a measure of the statistical significance of the coefficient(s) for the variable.The statistic used for linear regression is an F-test; for logistic regressionlikelihood ratio, score, and Wald test.
A "best subsets" are the number of models containing one, two, three variables, and so on, which are fitted to determine the "best" according to specified criteria.
Due to meeting the current research's goals, only the "best subsets" approach from the listed able can be applied.The statistical measure that is commonly used to compare models with different numbers of parameters based on the parsimonious principle is the Akaike Information Criterion (AIC).It measures the distance between a candidate model and the accurate modelthe closer the distance, the more similar the candidate is to the truth model.AIC calculates the distance between models as expected relative to Kullback -Leibler (K-L) divergence.Although AIC is a consistent estimator of K-L divergence, there is no statistical test to compare values of AIC [8][9].
Another criterion to compare candidate models is Bayesian Information Criterion (BIC), derived from Bayesian statistical analysis and estimates.BIC approximates a Bayes factor with desirable properties for hypothesis testing and model selection [10][11][12][13].BIC is calculated for each candidate model by equation ( 1) which takes values in the set {0, 1} and which is built with learner algorithms: logistic regression, support vector machine (SVC), or decision tree classifiers (RandomForestClassifier), a maximized loglikelihood from ( 1) is calculated as a logistic loss function: ) log( 1)), where i p is a probability with which a fit model predicts a positive class The model with the smallest BIC is preferable because the complex models are almost always likely to fit the data better, so the first term in definition (4) will have a low value; however, the second provides a way to penalize these extra parameters, therefore causes BIC is increasing.To assess the goodness of fit of the selected candidate models compared to etalon (or best-fit) models in works [14] is proposed to apply testing of the hypothesis based on a difference between sample means of the model's performance metric.When the mean accuracy of the selected parsimonious models is 1 A and the mean accuracy of best-fit models is 2 A then the parsimonious models fit if the null hypothesis is not rejected by the computed twotailed p-value of the t-statistic (eq.6).
Вісник Національного технічного університету «ХПІ».Серія: Системний 84 аналіз, управління та інформаційні технології, № 1 ( 9)'2023 where n is the number of the parsimonious models included in the test; ddof is the delta degree of freedom with a value equal to 1.Other classification metrics, such as AUC, f1 score, precision, recall, and Jaccard score, can be used to measure the goodness of the parsimonious model in the same manner as specified in equation 6 for the accuracy metric.
The problem statement.We aim to build parsimonious models for four datasets considered in works [1][2] to avoid overfitting problems associated with the best-fit models.To design an algorithm for assessing a parsimonious model's performance compared to the best-fit model and selecting the best candidate.To execute tests to prove that the proposed algorithms can be used with other datasets.
Experiment Methodology.Our experiment methodology for constructing and assessing the parsimonious model is specified per each phase of the supervised learner model's creation lifecycle [15].
Data preparation and acquisition.Original data was formed based on a survey conducted among business analysts and requirement engineers in Ukraine regarding their use of requirement elicitation techniques and their context.Three hundred twenty-eight practitioners completed the survey.Four respondents were disqualified due to incorrect data: non-filled industrial sector and non-filled team types.The features included in the dataset used in this study are two types:  features to describe the project's context;  features to list all elicitation techniques used in the project.The following features belong to the first type:  country;  project size: smalltill 15  The dataset contains information about the features, along with the names of target classes such as "Elicitation", "Document Analysis", "Interface Analysis", and "Process Analysis".However, a feature with the same name as a target class is not included in the list of features.Databases' characteristics and imbalanced ratios calculated as majority-to-minority samples are specified in table 1.
Data preprocessing.The imbalance predictors matrix X and a target vector y were transformed into balanced X*, y* by applying SMOTE method.This method allows us to  In lines 2-5, candidate models are fitted with increasing by one number of included features.The first candidate model includes one feature, and the last candidate includes F features, where F is the maximum number of features in our datasets.In line 2, i-features are sorted according to their mutual information (MI) score in descending order; the i-features are selected from the start of the sorted list with MI scores.In lines 3-4, the predictors' matrix is truncated to include only selected features, and a target variable and train and test subsets are formed from it.In lines 5-6, a model fits with the training subset, and performance metrics accuracy (Acc) and area under the ROC curve (AUC) are calculated on the test subset.In lines 7-8, if the model object's calculated performance satisfies the minimum performed level of accuracy (Acc_min) and AUC (AUC_min), then the model object is saved in the result vector S .
A general guideline is used in supervised machine learning with the following intervals for accuracy and AUC metrics:  if Accuracy/AUC = 0.5, then this is a guessing equal to flipping a coin;  if 0.5 < Accuracy/AUC < 0.7, then this is poor classification;  if 0.7 < Accuracy/AUC < 0.8, then this is an acceptable classification;  if 0.8 < Accuracy/AUC < 0.9, then this is an excellent classification;  if Accuracy/AUC >= 0.9, then this is outstanding discrimination.The above rules are to be used to set minimum values of Accuracy and AUC for the algorithm (fig.1).If in the result of the execution of the algorithm vector S is empty, then we propose to lower the minimum values of the performance metrics.If vector S is not empty, then we can move on to grade candidate models by Bayes factor and grades the steps undertaken are described as per pseudocode (fig. For each model object from S in line 2 is identified a BIC weight, denoted as m w .Then in lines 3-6, each model is graded according to the Bays factor's rules."Positive" models are saved to vector 1 M ."Strong"to vector 2 , M "Very Strong"to vector 3 M .In current work, we ignored "weak" candidates.,, M M M compared to best-fit models B was done through the steps as per pseudocode (fig.3).
Input:  In lines [1][2][3][4], mean values, the t-statistic, and the two-tailed p-value of the normal distribution for "very strong" models are computed.In lines 5-7, if the null hypothesis is not rejected, then a parsimonious model is added to the result vector R .Lines 8-9 are executed if the goodness of fit test is failed for models from 3 M .In this case, steps 2-7 are repeated with "strong" and "positive" models.Lines 11-14 are executed only if all models from ,, M M M failed goodness of fit test.In that scenario, the model M with the best performance is selected from Study results and their discussion.Multiple candidate models are created according to designed algorithm (fig.1).Applied Bayes factor grades as specified in fig. 2 allowed to select: a "very strong" parsimonious model to recommend Interviews as an elicitation technique that included eight features and evaluated with performance Accuracy = 90 %; AUC = 98 % (fig.4, a) which are 4 % and 1 % lower than Accuracy and AUC of best-fit model (table 2 -"Interviews"); a "very strong" parsimonious model to recommend Document analysis as an elicitation technique that included five features and evaluated with performance Accuracy = 90 %; AUC = 95 % (fig.4, b) which are 1 % and 2 % lower than Accuracy and AUC of best-fit model (table 2 -"Document Analysis").
A "strong" parsimonious model to recommend Interface analysis as an elicitation technique that included nine features and evaluated with performance Accuracy = 81 %; AUC = 88 % (fig.5, a) which are 3 % and 2 % lower than Accuracy and AUC of best-fit model (table 2 -"Interface Analysis"); a "strong" parsimonious model to recommend Process analysis as an elicitation technique that included fifteen features and evaluated with performance Accuracy = 81 %; AUC = 86 % (fig.5  As specified in fig.3, the hypothesis test is applied with the models' performance metrics from table 2. A null hypothesis H0: the mean difference between the parsimonious and best-fit models' accuracies is 0.An H1 hypothesis: the difference between the accuracies is different.T-statistic per equation 7 gives t = -2.8.The pvalue with the degree of freedom equal to 3 is 0.066, which is greater than 0.05, so our H0 is accepted, i.e., the parsimonious models are accepted, and best-fit can be ignored.Similarly, the hypothesis test with a null hypothesis H0: the mean difference between the values of AUC of the parsimonious model and the values of AUC of best-fit models is 0. H1 hypothesis: the mean difference between AUC values is different.T-statistic per equation 7 gives t = -7.The p-value of t = -7 with the degree of freedom equal to 3 is 0.006, which is less than 0.05, so our H0 is rejected, and the best-fit model is preferable due to the reduced parsimonious model's performance based on the mean value of AUC.
Thus, it can be concluded that applying the algorithm as per fig. 3 with each performance metric in sequence helps to identify when a parsimonious model's performance is degraded and decide on the suitable model's selection.We accepted the built parsimonious models in the current test experiment because the model's accuracy didn't deteriorate based on the goodness of fit test.
Conclusions and perspectives of further development.In the current study, the algorithms to build parsimonious candidate machine learning models and select the best candidate were designed and tested with four datasets collected for requirement elicitation technique selection.The results showed that the best candidate models graded as "very strong" and "strong" reduced the number of features: three times for Interviews and Interface analysis, five times for Document analysis, and 1.7 times for Process analysis.It helped to avoid the overfitting data problem.
The designed algorithm to assess the goodness of fit of the parsimonious models was applied with two performance metrics: accuracy and AUC in sequence.Based on the received results is concluded that by applying the proposed procedure, the gaps in the performance of the parsimonious model compared to the best-fit model can be detected, and a decision on the suitable model's selection can be made.
In summary, the obtained results allow us to recommend using a parsimonious model instead of the best-fit model to predict the using the particular elicitation technique in IT projects and form recommendations based on that model.
Several directions for future research can be considered, such as creating machine learning models for other business analysis techniques, e.g., specification and modeling, prioritization, and structure of business analysis architecture.

Fig. 2 .
Fig. 2. Steps to grade the candidate models Model validation.The assessment of the goodness of fit of models from 1 2 3

Fig. 3 .
Fig. 3. Steps to assess goodness of fit of parsimonious models algorithm (fig.3.)leaves experts to finally judge which model to use if all parsimonious candidate models failed the assessment.It could be either best-fit models from B or the parsimonious model with the best performance metrics because their minimum values are set as an input parameter of the algorithm (fig.2).

Fig. 5 .Fig. 4 .
Fig. 5. Candidate model(s) BIC weight, Accuracy, AUC: a -Interface analysis; b -Process analysis candidate models from the same dataset but include a different number of features;  compare and select the best candidate as a final parsimonious model;  assess the fit of the selected candidate.