TECHNOLOGY OF IDENTIFYING ANTIPATTERNS IN ANDROID PROJECTS WRITTEN IN KOTLIN LANGUAGE

The problem of the lack of instruments for identifying the characteristics of low-quality code in Android projects that are written in the Kotlin language is determined. A review of modern approaches for identifying antipatterns in program code is accomplished. The analysis of the methods used to find problems with code in Android projects is performed. DECOR and Paprika approaches are considered. Conclusions are drawn about the importance of finding design flaws in program code for the mobile software development and its further support. An antipatterns identification approach for Kotlin language program code in Android projects is proposed. An algorithm for identifying low-quality Kotlin code is presented. The technology for detecting poor quality code characteristics consists of four stages: collecting metrics about an analyzed software system, building a quality model, converting a quality model into a graph representation, and identifying predefined antipatterns. The collection of metrics, including the search for both Androidspecific and object-oriented metrics of Chidamber and Kamerer, is proposed to be implemented through parsing source code and converting it into an abstract syntax tree using the KASTree library. The implementation of KASTree library usage is offered through the Adapter design pattern. The construction of a quality model is implemented using the Paprika tool, supplemented by a number of introduced metrics. Conversion of quality model exactly into graph representation is used to identify antipatterns in order to ensure the speed and quality of complex queries execution for identifying antipatterns. Antipatterns identification using database queries is based on various template rules, including the Catolino rules. Different features of applying the Cypher query language to a graph database are used to represent the rules in form of queries. Results of the work can be used in development of software for poor quality code identification in mobile applications written in Kotlin language, as well as in studies of mobile development antipatterns for this language.

Mohammed Ilyas Azeem in his work [12] analyzed machine learning techniques that were investigated to identify low-quality code. He concluded that most existing studies use decision trees or the support vector method as machine learning algorithms. However, the problem of creating the optimal configuration has not been properly solved.
Problems of identifying low-quality code in Android projects. Analysis [5][6][7][8][9] found a large number of studies devoted only to the Java programming language. Java remains one of the most popular programming languages in the world and for Android projects in particular. However, nowadays, Kotlin programming language is also becoming popular [13], which is being developed rapidly and is supported by many developers all over the world. Kotlin is included in the list of officially supported languages for developing Android applications. Since May 7, 2019 it is the recommended language for Android application development. However, no tools were found that could detect low-quality code written on Kotlin. There is still a problem with the lack of methods, technologies or algorithms for detecting poor quality code characteristics for projects created using Kotlin. The main cause of this problem can be considered the relative youth of the language. The stable version was released only 4 years ago. Research into identifying code issues for Kotlin is very important.
Another problem is the presence of non-identifiable characteristics. Since the beginning of antipatterns studies and their classification, there have been various attempts to identify the characteristics available. However, a review [5,6,8,10] confirmed that developed applications can only detect some of the code issues described. Fig. 1 shows how often different antipatterns have been studied in the literature. From fig. 1 it follows that some characteristics are studied more frequently, while there are still poor quality code factors for which no relevant studies have been conducted. Of all the characteristics of the bad quality code identified by Fowler [2], there are still those that are not determined by any identification technique. There are more than thirteen different techniques in the literature regarding 21 code defects, and 16 methods have been developed for the God class only. However, none of them shows the above mentioned defects. It can be concluded that not all code flaws are currently identifiable.

Fig. 1 Study frequency of individual characteristics
Analysis of existing methods. The literature [5,9] describes two main approaches to identifying the characteristics of low-quality code in Android applications. The first one is used in software called DECOR. Its main idea is to use the expertise knowledge to build the classification and taxonomy that generates the identification algorithms. Another method was developed for Paprika SP. It is based on the metrics of the application being analyzed and the construction of quality and graph models. Based on the latter, source code defects are detected. Fig. 2 illustrates a common algorithm for identifying antipatterns in code. A key element is the identification approach, which varies by tool. However, it receives input on the characteristics of poor-quality code and software metrics generated from source or compiled code. The antipattern detection technology in DECOR uses a four-step algorithm. At the first step, the experts analyze the subject domain and identify the key concepts on the basis of which the classification and taxonomy of all characteristics of poor quality code are made. The second step is the specification of factors using domain-specific language (DSL). The key concepts are formalized in the form of rule cards, where a card is a set of rules which contains characteristics that describe a particular antipattern. DSL allows to determine the links, properties, and internal structure of the antipattern using metrics. The rule cards then automatically generate an algorithms for issue identifying with code using DSL and the source code parser. In the last step, the generated algorithms are automatically applied to system models and identify suspicious classes.
DECOR is designed with features of Java language, its syntax and corresponding code-writing convention. Therefore, in the study [9], all metrics are calculated according to this programming language. In addition, the vocabulary and taxonomy were developed for only four antipatterns. This means that DECOR can only identify them. The disadvantage of this approach is a great dependence on peer inspections. The first two steps of the algorithm are not automated, and adding a new antipattern for identification will be time-consuming and will require experts.
The approach used in the Paprika tool contains a three-step algorithm. The first step is to collect metrics. Input is one APK files and related metadata. Output is a Paprika quality model that includes entities, metrics, and properties. At this step generates a mobile app model and removes quality metrics from the input artifact. Paprika builds the model based on 6 entities. 17 properties describe entities and attach to them as attributes. Properties and entities are united by connections. Paprika also pulls metrics for each entity. There are currently 34 metrics available. The method uses 2 types of metrics: objectoriented and Android-specific. Unlike properties, metrics require the calculation or processing of byte-code. The quality model is built using the described parameters. The second step is the conversion of the quality model into a graph model. The input is a model received at the previous step. The output is a graph model stored in the database. Because graph databases are independent of the rigid scheme, the graph model is almost the same as the first step model. All entities are represented as vertices of a graph. Attributes and metrics are properties of vertices. The connections between the entities are represented by unidirectional edges. The last step is the identification of antipatterns. Input is a graph database containing a quality model. Output is vertices, and therefore entities containing antipatterns. Once the model is downloaded and indexed by the graphical database, you can use the database query language to identify common characteristics of poorquality code.
Because Paprika analyzes byte-code, this means that this tool can only analyze Java-written applications. In addition, the byte-code often fails to get accurate metric estimates. This was stated by the author himself in his research [5].
Formulation of the problem. An analysis of works [5][6][7][8][9][10] designates two major problems with identifying poor-quality code in Android projects: the lack of methods and tools for Kotlin and the presence of unexplored programming antipatterns. Kotlin [13] has been identified by Google as being a recommended development tool for Android, which is rapidly developing and gaining popularity. Considering also that there are only four poorquality code unexplored factors, it can be concluded that the first problem is more critical. In addition, it should be noted that more than 8% [13] of all developers use Kotlin all the time. Thus, it can be said that the topic of research of the identification of problem code for Android projects written in Kotlin is relevant.
According to the analysis [5][6][7][8][9], two main methods for identifying poor-quality code were determined. Because the DECOR approach is not fully automated, the authors will not rely on it for research. The approach used in Paprika is more promising, as the author revealed not Вісник Національного технічного університету «ХПІ». Серія: Системний 120 аналіз, управління та інформаційні технології, № 1 (3) 2020 only general characteristics, but also Android specific ones. However, the results of the metric calculation, and therefore the identifications will be more accurate when analyzing the source code. It will also expand the list of used metrics. This approach will increase the number of identifiable antipatterns. Therefore, it is proposed in the future to improve the Paprika technique by analyzing the source code and to use it to identify Kotlin code flaws. Thus, the purpose of the work is to investigate and improve the method of identifying poor-quality code for Android projects written in Kotlin.
Low-quality code identification technology. Identification technology developed by the authors builds on four-step approach. Generalized scheme of it is shown in fig. 3. First step includes syntax analysis of project source code. In contrast to Paprika [5], where byte code is analyzed, it was decided to work with source code. This has the following advantages: elimination if information loss, higher accuracy of obtained results, ready-made source code metrics can be used, original names of components are preserved. In addition, byte code is stored in archives, which imposes additional restrictions on operation with system. Source data can be both a link to a project directory and a link to a web hosting service where project is stored (GitHub, Gitlab, Bitbucket etc.). It is suggested to use the KASTree library for syntax analysis. It allows to present the source code as an abstract syntax tree (AST). On the next stage quality model is constructed based on the obtained AST. The syntax tree provides information about classes, methods, variables and relationships between them. Unlike the Paprika quality model this approach also includes object-oriented Chidamber and Kemerer metrics [14]. In the third step after building quality model, it is converted for saving into graph DB. On the last stage identification is performed by calling prepared queries to DB. Queries are needed for searching antipatterns. After that, a report is created listing found code flaws and their location. We briefly describe content of each of the steps in the next sections.
Syntax analysis is a process of converting source code into structured representation. It is needed for building an AST, which can help quickly get needed metrics for generating the quality model. AST is a tree representation of the abstract syntactic structure of source code written in a programming language.
AST is a tree data structure which is a finite set with the following properties:  There is only one root of the treeproject directory;  Other nodes are syntactic constructions found in the source code;  All non-root nodes are distributed among disjoint sets and each set is a subtree; wherenumber of syntax constructions. AST can be obtained from a Kotlin project using KASTree library. However, the result of this tool is a syntax constructions list that are not a unified data format.
It is suggested to use Adapter design pattern to provide flexibility and extensibility of the system, which is shown in fig. 4. It converts KASTree library output to a common JSON data format. In case of changing syntax parsing instrument it is not needed to change logic of using AST on the next stage.

Quality model generation.
This model is based on the quality model which used in Paprika instrument [5]. It includes 6 entities: Package, Class, Method, Attribute, Variable and Argument. Each entity is described by attributes, such as full name, access modifier, type and others. The model provides 7 types of relationships between entities: Package Has Class, Class Has Method, Class Has Attribute, Method Has Argument, Inherits, Calls, Uses. Relationships exist between two determined types of entities. For example, relation Inherits can exists only between two entities of type Class. In addition to attributes entities has source code metrics. Model provides 34 metrics. They are dived into object-oriented (OO) and Android-specific metrics. OO metrics consists of simple and computational values. Simple measures can be obtained directly from the AST, e. g. number of methods in a class, number of parameters in a method and so on. On the other hand, computing metrics requires additional calculations. Authors propose to supplement this data with Chidamber and Kemerer metrics. Their usage expands the range of code flaws that can be identified and improves the accuracy of the results. We briefly describe these metrics.
Weighted Methods per Class (WMC). Consider class with set of methods 1 , 2 , … , which are defined in this class. Let 1 , 2 , … cyclomatic complexity of methods. Then: Cyclomatic complexity is computed using the control flow graph of the program: the nodes of the graph correspond to indivisible groups of commands of a program, and the directed edge connects two nodes if the second command might be executed immediately after the first command. Then: where number of edges in graph for -th method; number of nodes in graph for -th method. Depth of Inheritance (DIT). This metric is used to determine the location of a class in the inheritance hierarchy. DIT shows how many class ancestors can potentially affect the class. DIT is defined as a maximum number of ancestral classes per class. It is needed recursively bypass the inheritance tree before reaching the first ancestor class to find this metric. The number of attended classes is DIT.
Coupling Between Objects (CBO) for a class is a count of the number of other classes to which it is coupled. Coupling between two classes is said to occur when one class uses methods or variables of another class. COB is measured by counting the number of distinct noninheritance related class hierarchies on which a class depends. Let class with set of methods 1 , 2 , … and set of variables 1 , 2 , … , which are used in this class. Herewith ∉ , ∉ , = 1, , = 1, . Then: Response for a Class (RFC) is the count of the set of all methods that can be invoked in response to a message to an object of the class or by some method in the class. This includes all methods accessible within the class hierarchy. RFC is defined as follows: where { }set of methods called by -th method; { }set of methods, which belong to class.
Lack of Cohesion in Method (LCOM) measures the extent to which methods reference the classes instance data. Consider a class with set of methods 1 , 2 , … . Let { } is set of instance variables used by method . There are such sets { 1 , 2 , … , }. Let = {( , | ∩ = ∅)}, and = {( , | ∩ ≠ ∅)}. If all sets are empty then let also is empty. Then: Catolino described [15] rules for determining code flaws using described metrics. As shown below, it is possible to identify code smells by converting these rules into database queries.
Transformation into a graph representation. Model must be presented as a graph for convenient and efficient operation of it. If entities of the quality model are considered as vertices of the graph, relationships between entities as edges, and attributes and entity metrics as properties of vertices, then the quality model can be converted to graph form. This graph is stored in memory using a graph DB. This solution is flexible and efficient, because such approach of data storage does not depend on a rigid scheme. Thereby converted model ( fig. 4) is the same as that described in previous section. In addition, graph repositories show high performance with datasets up to 2 35 nodes and relationships. This allows to identify antipatterns even on large systems. God class is a class that contains a large number of fields and methods. It is responsible for different logic, its attributes are related to different processes, which implies strong connection with other classes. Such classes are difficult to maintain and increase the complexity of software modification. Author [15] proposes to use metrics such as WMC, LCOM, number of methods (NM) and number of fields (NF) for identification God Class cases. If for any class LCOM > 15 and WMC > 9 or NM > 12 and NF > 8, then it is considered as God Class. The graph Вісник Національного технічного університету «ХПІ». Серія: Системний 122 аналіз, управління та інформаційні технології, № 1 (3) 2020 DB query in Cypher notation for identification such antipatterns is shown in fig. 5. In Android, class fields should be available directly for performance reasons. Usage of internal getters and setters turns into a virtual call, making the operation three times slower than direct access. Internal getters and setters can be identified using the graph model. Query for this antipattern is shown in fig. 6. This query looks for two methods from one class, when one calls the other, designated as a getter or setter. Conclusions. This work describes a technology of identifying poor-quality code for Android projects written in Kotlin. It is based on the work of Hecht [5] and is an option to improve the Paprika tool and adapt it to the Kotlin programming language. The proposed approach uses source code instead of byte-code and complements the object-oriented metrics offered by Hecht. This will increase the number of antipatterns of identification and, using the work [14,15], improve the accuracy of the results. The implementation of the described technology will effectively identify both object-oriented and Androidspecific characteristics of poor quality code. As a future extension of the study, authors suggest to use proposed approach in developing software of antipatterns identification in Kotlin web-based applications or adapt it for Swift language, which is used in developing projects for iOS platform.