arela
Version:
AI-powered CTO with multi-agent orchestration, code summarization, visual testing (web + mobile) for blazing fast development.
140 lines (99 loc) • 54.1 kB
Markdown
Conceptual Fit: Information Flow vs. Link Density
Infomap’s information-flow model aligns well with vertical slices. In a well-architected codebase, a developer working on a feature (e.g. authentication) tends to stay within that feature’s files; imports flow mostly inside the slice, with rare jumps outside. Infomap is designed to capture exactly this scenario: it treats the network as a “map” of a random walker and finds communities where the walker tends to get “stuck”  . The algorithm optimizes the Map Equation by minimizing the description length of a random walk, grouping nodes into communities such that movement within a community is frequent and movement between communities is infrequent  . In other words, Infomap finds regions in the graph “among which information flows quickly and easily”  – exactly what we expect for cohesive feature modules. A “vertical slice” in code should correspond to a region of intense internal information flow (many imports within the feature) and minimal flow crossing its boundary, which is precisely what Infomap’s objective captures.
Modularity-based methods (Louvain/Leiden) optimize a different criterion. Louvain/Leiden attempt to maximize modularity, favoring partitions with higher-than-random internal edge density . While this also tends to group connected files, it doesn’t explicitly model the flow of execution or information. In practice, modularity algorithms can suffer the well-known resolution limit – they may fail to separate small but meaningful communities in a larger network . This means distinct feature slices could get merged if the graph is large or if the slice is below the algorithm’s resolution scale . Infomap, using information flow, is not bound by a single resolution parameter and can detect smaller communities if they trap the random walker. This conceptual difference suggests Infomap will naturally find the “vertical slice” boundaries (where a random walker would seldom leave) better than modularity methods, which might overlook or merge some slices due to resolution biases. In summary, minimizing description length (Infomap) is a closer proxy for identifying feature boundaries than maximizing edge density (modularity) , because code modules are about contained information flow rather than just edge count.
Performance on Small Graphs (10–50 Nodes)
Infomap works effectively on small, dense graphs and is stable. There is no minimum size requirement for Infomap – it can cluster a graph of tens of nodes just as well as hundreds or thousands. In fact, Infomap has been used in scenarios ranging from tiny networks up to very large ones, and it remains effective at small scales . Unlike some algorithms that require many nodes to develop clear statistical communities, Infomap will partition even a 10-node graph if structure exists. One advantage noted in practical comparisons is that Infomap yields consistent, robust results even across multiple runs  . The algorithm’s use of a global optimization (with iterative improvements similar to Louvain’s approach) leads to partitions that don’t arbitrarily change on each execution, especially for small graphs. In contrast, algorithms like Label Propagation can vary wildly between runs, and even Louvain/Leiden have some nondeterminism due to random initializations or node order effects . Infomap is frequently praised for producing “stable and reliable outcomes” based on its information-theoretic approach  . In practice, this means running Infomap on the same code dependency graph is likely to give the same community split (especially if a fixed random seed or deterministic mode is used). This stability is important for our use case – we want the same codebase to consistently yield the same slices.
Louvain/Leiden on small graphs can misbehave without tuning. As we observed, a naive Louvain implementation returned each file as its own community in a 13-node project. This is an anomaly – ideally, Louvain should have merged some nodes – but it can happen due to the algorithm getting stuck in a local optimum or because the gain in modularity for merging small dense groups was below the resolution threshold. Modularity methods have a known tendency to either merge small communities or leave a fragmented partition if the graph is small and edge weights are uniform . There is a resolution parameter that can be adjusted (e.g. NetworkX’s implementation allows setting resolution < 1.0 to favor larger clusters ), but this is essentially a manual tweak. Infomap does not require such a parameter; it inherently finds the partition that best compresses the random walk description at the natural scale of the data. This makes Infomap well-suited for graphs in the tens of nodes range – it will neither over-aggregate nor split arbitrarily just due to graph size.
In summary, Infomap performs reliably on 10–50 node graphs and yields stable community assignments (same input → same output), given its deterministic nature with a fixed seed. The computational cost is negligible at this scale (well under a second), so one can even run multiple trials to ensure the optimal solution is found. There is no evidence that Infomap needs a larger network to “kick in” – even a handful of nodes can be meaningfully clustered if the link pattern suggests communities.
Handling Dense Subgraphs (Cliques and Tight Coupling)
Dense subgraphs (cliques) naturally form Infomap communities. If files within a feature are all mutually connected (e.g. via directory edges or mutual imports), they form a clique or near-clique. Infomap will almost certainly keep such a tightly-knit group together in one module, because a random walker would circulate among those files and rarely escape. This is desirable: a set of files that are fully interdependent should be identified as one feature slice. Infomap doesn’t arbitrarily split fully connected groups – doing so would increase the description length, since the walker would frequently cross the artificial boundary. Instead, it prefers to merge dense regions into a single community, reflecting their high internal cohesion. This means Infomap handles high internal coupling well: it won’t “over-split” a feature that has a lot of internal calls. In our codebase scenario, files under the same directory with many import links (explicit or implicit) are likely to end up in one community.
Potential over-merging is mitigated by Infomap’s flow optimization. A concern with very dense graphs is whether Infomap might merge distinct sub-communities into one because there are no obvious sparse cuts. Infomap will indeed consider merging areas if the random-walk flows freely between them. For example, if two feature modules are heavily interconnected (lots of imports both ways), Infomap might identify them as one larger community, since a walker doesn’t get “trapped” in just one. This could be seen as over-merging from an architectural perspective (two features identified as one). Notably, a study comparing Infomap and Leiden found that Infomap tends to produce fewer, larger communities on software graphs, whereas Leiden produced more, smaller communities . Infomap was aggregating many classes into the same component when the structural connectivity was high, while Leiden split them. In that study, Infomap gave 3 communities vs Leiden’s 7 on one project, 6 vs 12 on another, etc. , indicating Infomap was more willing to keep a dense cluster intact rather than sub-divide it.
However, whether this is “over-merging” or correctly identifying a bigger module depends on the ground truth. Infomap’s bias is to merge as long as internal flow is strong and there isn’t a clear flow boundary. This is actually a strength for vertical slices: if two groups of files are truly so interdependent that a random walk sees them as one region, it might mean they are logically one feature (or that the architecture is tangled). Leiden (modularity) might force a split due to counting edge densities, but that can result in separating files that are conceptually one unit. One empirical finding: Infomap communities often had higher internal semantic cohesion than Leiden’s communities , suggesting Infomap kept strongly related code together. The trade-off was that Infomap sometimes yielded one component so large it spanned multiple semantic topics, lowering overall separation . For instance, Infomap on a project grouped many classes into one big cluster (perhaps a central framework), whereas Leiden split them; Leiden’s clusters were better separated (less overlap in content) in 7 out of 9 cases  . This indicates that in extremely dense regions, Infomap might group them as one, even if that group covers diverse functionality – essentially because the code itself is highly entangled.
Bottom line: Infomap will treat fully-connected or strongly-connected file sets as single communities, which is usually correct for well-designed slices. It generally will not split a clique (avoiding under-merging issues). But if your entire codebase is one hairball of dense connections, Infomap might output one giant community, whereas Leiden might arbitrarily cut it into smaller pieces. In practice, if the codebase is moderately modular (several dense subgraphs loosely connected), Infomap should correctly identify those subgraphs as the communities. If we encounter Infomap merging what we expect to be two separate features, that’s a sign those features have a lot of mutual coupling (perhaps needing architectural refactoring). We could then either accept that they function as one module or consider using a method with a resolution parameter to force a split if truly desired (Leiden with higher resolution, or manually separating by known boundaries). But under normal circumstances, Infomap handles dense subgraphs by keeping them intact, aligning with the intuitive notion of a feature module.
Hierarchical Structure and the Two-Level Approach
Infomap supports hierarchical community detection, but one level is often sufficient for features. By default, Infomap will partition a network hierarchically (multi-level) if it finds sub-structures that improve compression . This means it can identify communities within communities – for example, it might cluster files into small groups and then cluster those groups into a larger module. In the context of code, a multi-level result could reveal layers: e.g., maybe within an “engine” module, Infomap might further detect sub-communities corresponding to subsystems. However, in a typical vertical slice architecture (especially with only 10–50 files), we expect a flat modular structure (features composed of files, with perhaps one intermediate grouping by directory which we already model via edges). We likely don’t need multiple levels of communities.
The --two-level parameter in Infomap forces the algorithm to find a single-level partition (no nested modules). This is recommended for our use case because we are explicitly looking for top-level feature slices as the output. Using two-level mode will yield a flat list of communities (each community = a slice of files), which is easy to interpret as feature modules. This avoids any confusion where Infomap might otherwise report, say, Module A containing sub-modules A1 and A2 – which might be more detail than we want. For small graphs, Infomap’s multi-level ability might not even trigger unless there is a clear reason. But enabling two-level ensures that, for example, if there is a slight hierarchical structure (say a utility subcomponent inside a larger component), the algorithm doesn’t separate it into nested output.
In practice, Infomap’s hierarchical detection could be useful if the code has layered architecture. If our project had, for instance, “feature groups” that themselves cluster into higher domains, multi-level Infomap could potentially identify that. (An anecdotal example: one study applied multi-level Infomap to its own tool’s codebase and found it divided into three high-level groups, with further nesting inside  .) But for clearly delineated vertical slices, those top slices are what we care about. Therefore, using two-level Infomap is appropriate and it tends to work well: it finds the optimal flat partition without hunting for deeper splits.
Hierarchical output in Infomap has been tested on codebases. Researchers have compared a multi-level Infomap clustering to a Louvain clustering on a codebase and found them broadly similar at the top level  . This suggests that Infomap’s first-level communities correspond to meaningful components. Indeed, in a case study on a game’s source code (Super Mario), Infomap correctly captured all the top-level components (e.g. engine, app, utilities) with only minor misplacements of some subcomponents . This gives confidence that the primary level of Infomap’s partition aligns with actual architectural features. We can thus safely use two-level clustering to directly obtain those feature groupings, trusting that each Infomap community is one vertical slice.
To conclude, Infomap can detect hierarchical structure, but we will likely constrain it to one level for vertical slices. The two-level setting is effective for our scale and needs, and it simplifies interpreting the results (each community is a feature). If future needs arise to see sub-slices or group multiple slices, we could run multi-level and examine the nesting, but for now the focus is on the top-level partition.
Infomap vs. Leiden on Software Dependency Graphs
Comparative studies show mixed results, with Infomap often more cohesive and Leiden more separated. Both Infomap and Leiden (a refined Louvain algorithm) have been applied to software dependency graphs (e.g., class or file dependencies) in academic research  . One study (Palomba et al. 2021) compared Infomap and Leiden for identifying software components from Java projects . They found that Leiden consistently produced a larger number of smaller communities, while Infomap produced fewer, larger communities . For example, in one system Leiden found 12 communities where Infomap found 6; in another Leiden found 26 vs Infomap’s 16 .
The quality of these clusters was evaluated using semantic metrics (how coherent the code in each cluster was, and how well-separated different clusters were). The results were intriguing: Infomap’s clusters tended to be more internally cohesive (higher similarity of code within the same community in 5 out of 9 cases) , meaning Infomap often put very related classes together. However, Leiden’s clusters were more separated from each other – in 7 of 9 cases, Leiden’s communities had lower similarity to one another than Infomap’s communities did . In other words, Leiden yielded more distinct boundaries. This was reflected in a combined metric (silhouette score) where Leiden outperformed Infomap in 8 of 9 trials  . The authors noted that Infomap sometimes created one overly large cluster that spanned multiple functionality areas (harming inter-cluster separation) . Leiden avoided that by splitting, at the cost of a bit less cohesion in some clusters . Their conclusion was: “Leiden extracts less cohesive, but better separated (and better isolated) components… Infomap creates more cohesive, slightly overlapping clusters that are more likely to depend on other similar components.” .
Translating this to our context: Leiden might identify finer-grained modules with very few cross-links, whereas Infomap might lump some closely connected features together, resulting in larger chunks that sometimes share conceptual overlap. Which is “better” can depend on ground truth. If the goal is strictly to minimize inter-module coupling, Leiden’s extra splitting could be useful. But if the goal is to maximize internal coherence (all related code together), Infomap’s approach might be preferable.
Domain expert feedback favors Infomap’s granularity in practice. In a real-world scenario of refactoring a monolith into microservices, a developer compared Louvain vs Infomap outputs. Infomap produced many more, smaller clusters (high granularity), while Louvain produced fewer larger clusters. Interestingly, a domain expert preferred the Infomap result because “it’s easier to merge smaller, well-structured domains into bigger ones than to split a big one that lumps unrelated things” . This aligns with the idea that Infomap, by possibly over-segmenting or giving very granular communities, provides a good starting point – humans can always combine two small slices if they actually belong together, but if an algorithm merges them initially (as Louvain/Leiden might), it’s harder to detect that they should have been separate. The thesis reporting this noted that despite Infomap giving a “high degree of granularity” (lots of tiny services), it was “more accepted by the external evaluator” for the above reason .
However, note that in that microservice clustering case, Infomap gave more clusters than Louvain (70+ microservice candidates vs ~27) . This contrasts with the Palomba et al. study where Infomap gave fewer clusters than Leiden. The difference likely arises from how edges were weighted and the nature of the graph (runtime call graph vs static dependencies). It tells us that Infomap doesn’t always merge everything; it can in fact split aggressively if the random walk dynamics warrant it. Infomap’s heuristic includes a clever “submodule movement” step that checks if a module should be split into smaller ones for better compression , so it will break up a cluster if doing so yields a shorter description length. In the microservice scenario, perhaps certain classes had distinct subdomains that Infomap isolated, whereas modularity might have kept them together for a marginal modularity gain.
Performance and stability: Leiden is generally very fast and guarantees well-connected communities (no broken clusters) with improvements over Louvain, but for our small graphs speed is not an issue. Both algorithms can run nearly instantly on <50 nodes. Regarding stability, Leiden still has some randomness but is typically stable given enough iterations. Infomap, as noted, is stable especially if we set a fixed random seed or run a sufficient number of trials. Neither algorithm should yield drastically different results on repeated runs (Infomap’s variability is low; Leiden’s is low as well, but slight differences can occur due to random initialization, although the Leiden algorithm’s refinement phase mitigates a lot of that).
Summary of comparison: Infomap is conceptually aligned with the “information flow” view of features and often produces very coherent feature groupings. Leiden (modularity) might produce a partition that isolates dependency clusters more stringently (maximally reducing coupling). Empirically, Leiden might split a feature if it sees any benefit to modularity, whereas Infomap might keep it whole if the feature’s internals are strongly connected. If our codebase analysis shows Infomap merging what we suspect are distinct features, we could try Leiden or adjust parameters to see if a higher-resolution cut is needed. But if Infomap is giving the expected 4–6 communities and grouping files intuitively, we likely have the optimal result already. The evidence does not conclusively say one is strictly better for all software graphs – rather, they have different tendencies. Since our use case prizes consistent, semantically meaningful slices (and we have a mental model that imports = information flow), Infomap is a very strong candidate. It has even been used successfully to recover architecture in codebases (with one report stating “all top-level components are well captured with Infomap” in a software architecture mining context ). Leiden is also a viable option, especially if we encounter scenarios where we want to enforce smaller communities or experiment with a modularity view. But given the conceptual fit and prior successes, leaning on Infomap first is reasonable.
Handling Edge Cases in Code Slice Detection
Let’s address specific edge scenarios and how to manage them:
• Singleton files (no imports): If a file has no import edges (and presumably only the implicit directory edge to itself or none), it will naturally end up isolated. Infomap will place an isolated node in its own community because there is nowhere else for a random walker to go that would shorten the description length. This matches our expectation: e.g. main.go that doesn’t import feature files should be a slice by itself. In our small example, main.go would likely appear alone. This is fine. No special action is needed – just be aware that a truly independent file becomes its own module. In output, we interpret that as “a singleton slice”. (If we desired, we could label it “Main” or “Misc” feature, but that’s a post-processing concern.)
• Shared utility files (imported by all slices): This is a tricky case. Suppose shared/utils.go is used widely across different features. In the dependency graph, utils.go will have edges to many clusters. What will Infomap do? There are a few possibilities:
• It might form its own community, because a random walker entering utils.go has equal probability to go to many other areas, meaning it doesn’t firmly belong inside any single cluster. If treating it as a separate module yields a shorter overall description (because it acts like a hub connecting communities), Infomap could isolate it.
• Or Infomap might attach utils.go to one of the feature communities (perhaps the one it’s most tightly connected to, if any). This could happen if, say, utils.go lives in a particular directory or has slightly more calls from one module than others, tipping the balance to merge it with that module.
Our preference is that utils.go should not form its own slice in the final analysis, since it’s not a feature by itself but a supporting library. How to handle this:
1. Interpretation approach: If Infomap outputs a singleton cluster containing only shared/utils.go, we can decide not to treat it as a “feature”. Instead, we might document it as a shared component. We could even merge it into an existing feature cluster manually (perhaps the cluster that logically owns it, or just leave it aside). The domain knowledge comes into play here – for instance, if all features use utils.go, it could be considered part of a “common” layer or simply an infrastructural utility.
2. Algorithmic approach: We could try to prevent it from being isolated by tweaking the graph. For example, assign a slightly higher weight between utils.go and one particular module (if we know it belongs closer there), or lower the weights of its edges in general. Lowering its edge weights would reduce its influence as a separate hub, possibly causing Infomap to tuck it into some community. However, doing so arbitrarily can distort results, so this must be done carefully or not at all. It might be better to accept the outcome and handle it in post-processing.
In summary, Infomap might isolate a widely shared file. Since we prefer not to highlight a utility as its own feature, we should be prepared to manually merge or label such singleton utility nodes. For instance, if utils.go comes out alone, we can attach it to the “Main” slice or whichever slice seems most appropriate, acknowledging that it’s a shared resource. Importantly, such a utility should not cause the entire algorithm to fail; it’s a single-node cluster which we can deal with after clustering.
• Circular dependencies between slices: If feature A imports feature B, and feature B imports A (directly or indirectly), the two slices are strongly coupled. In an ideal architecture, this shouldn’t happen (features should be more independent), but it’s possible. For community detection, a bidirectional coupling creates a lot of two-way flow between what we hope are separate communities. Infomap is likely to merge two clusters that have heavy two-way interaction, because from a random-walk perspective, A and B effectively function as one region (the walker can shuttle back and forth freely). Louvain/Leiden would similarly have trouble separating them if the inter-feature edge weight is high relative to internal weights – modularity would be higher by grouping them due to the dense cross-edges. So, if slice A and B have circular dependency, the algorithm might output them as a single community (or a larger community containing both sets of files).
If our expectation is that they should be separate slices (maybe the circular dependency is incidental or something we plan to break later), we have a few options:
• Recognize that the algorithm is reflecting reality: A and B are, as it stands, effectively one combined module. We could accept the merged cluster in the short term and note that these two features are entangled.
• If separation is crucial, consider artificially weakening the connection in the graph (e.g. reduce the weight of A–B imports or remove them for clustering purposes, on the assumption that we want to discover slices ignoring a known violation). This is dangerous ground since it manipulates data; it should be done only if we’re certain that the link is an anomaly.
• Alternatively, run Infomap in two-level mode and see if maybe it identifies subcommunities corresponding to A and B inside the merged cluster. Sometimes Infomap’s hierarchical process can internally separate a merged group into sub-modules. If it does, then at the multi-level output we might recover A and B as subclusters of a bigger cluster.
In practice, circular dependencies blur the lines between features. Any algorithm may find them inseparable unless the internal cohesion of each feature still outweighs the mutual coupling. If only one or two import links form the cycle (and many more links keep the features internally coherent), Infomap might still separate them because the random walker spends more time inside each than bouncing between them. But if the cycle means a lot of back-and-forth calls, expect a merge. Our approach should be to identify such cases from the results (e.g., if expected features A and B came out as one community) and flag them for architectural review. The tool can then report “Slices A and B appear merged due to strong bi-directional dependencies” rather than trying to force them apart arbitrarily.
• Disconnected components: If parts of the code graph are completely independent (no imports between them at all), any community detection will trivially put each independent component in its own community. Infomap will effectively see each connected component as a separate flow domain. This is perfectly fine and actually desirable – it means the code has multiple modules with zero coupling. For example, if you had two entirely separate feature sets in the repository (maybe a leftover module that doesn’t interact with the rest), it will emerge as its own slice. We should simply recognize that a disconnected set of files is a distinct slice by definition. There’s nothing special needed here; just ensure we don’t accidentally remove those edges from the graph (i.e., we include all relevant dependencies so that only truly nonexistent links lead to disconnection).
In summary, edge cases require a bit of human oversight:
• Isolated single files will appear as singleton communities (expected for things like main).
• Shared infrastructure files might form singleton communities too – these we might choose to merge or label separately, since they’re support code, not features.
• Strongly coupled features (circular or heavy cross-deps) will likely be lumped together by the algorithm; treat that as a sign of architectural coupling. If needed, adjust weights or note it as a special case.
• Fully separate components will correctly show up as separate clusters (which is correct behavior).
Our clustering process can incorporate these rules: for example, after Infomap outputs communities, we can scan for any cluster that is a single file with high degree (potential utility) and decide how to handle it, or check if an expected feature got merged and issue a warning. Overall, Infomap provides the raw grouping; we apply domain knowledge to refine edge-case outcomes.
Infomap Hyperparameters and Tuning for This Use Case
Key Infomap parameters:
1. Number of trials (--num-trials): Infomap uses a heuristic search that can get stuck in local minima. Running multiple independent trials improves the chance of finding the global optimum partition. For small graphs, a single run often suffices, but it’s cheap to do (say) 10 trials and take the best result. The algorithm is so fast on 50 nodes that even 100 trials would be instantaneous. In practice, we can set num_trials = 10 (or more) to ensure stability. The benefit is slight given the graph size, but it provides peace of mind that the partition is truly optimal (minimum description length found). Sensitivity-wise, Infomap might find the same partition consistently after a few trials if the structure is clear. If results vary between trials, it means there are nearly equivalent clusterings; increasing trials then helps ensure we pick the best one. So we recommend using multiple trials unless a deterministic mode is on.
2. Two-level vs. multi-level (--two-level): As discussed, use two_level = true to restrict to one-level communities. This prevents Infomap from introducing a hierarchy in the output. For our feature detection, this is appropriate. If two_level is false (multi-level), Infomap might output a nested structure (with .tree files listing submodules). That’s not needed for now. The two-level mode will still internally use submodule movements (to test splitting large modules) but it will report only the final flat grouping. In terms of results, on a small graph, multi-level might end up the same as two-level if it doesn’t find any beneficial subclusters. But explicitly setting two-level guarantees simplicity. This parameter is not particularly sensitive; it’s more about what output format you want. We will set it to avoid confusion.
3. Directed vs. undirected: Infomap can handle directed graphs (it will simulate a directed random walk following edge directions) or undirected. Our dependency graph has directed import edges (file A imports B is a one-way relation) and undirected “directory” edges. We have a choice: treat the whole graph as undirected (ignore direction of imports) or as directed (some edges one-way). Using directed mode could, in theory, be more accurate to information flow (since info mainly flows from importer to imported). However, code dependencies are not exactly flows of execution; they are static links. And a developer “stuck” in a feature might navigate both directions (reading definitions and references). In practice, many architecture recovery efforts simply use undirected graphs for dependency clustering . Undirected Infomap will consider an import as a bidirectional connection for clustering purposes, which is reasonable for grouping files. Directed Infomap could be tried if we suspect asymmetry matters, but it complicates things and might isolate nodes with only outgoing edges (or only incoming edges) strangely. For simplicity, treat the graph as undirected when feeding it to Infomap (e.g., for each import, add an undirected edge or a reciprocal edge). This ensures the random walker can move both ways along a dependency, better simulating a developer moving through code relationships. The --directed flag in Infomap can be turned off (default is undirected). We should be consistent: include the directory coupling edges as normal undirected edges too.
4. Weighting strategy: By default, we have set all edges (imports and directory co-location) to weight 1.0. Infomap will consider them equally. We might ask if any tuning of weights is needed. For example, should an actual import be considered a stronger signal of coupling than just “in the same folder”? Possibly – an import definitely means a code reference, whereas same directory is a weaker implicit relation. If we find Infomap splitting files that are in the same feature folder just because they lack direct imports, we might consider boosting directory edge weights (say 1.0 for import, 0.5 for same-folder, etc.) to enforce that those files stay together. Conversely, if we think directory edges are causing over-clustering (e.g., grouping things just because of folder even if no real usage), we could reduce their weight. This is a tunable aspect:
• Our initial approach can stick to equal weights (1.0). This already produced the expected communities in our example (since each feature folder became a clique via directory edges).
• If using equal weights led Louvain to not cluster, it might be because the directory edges created cliques that are all interconnected via a few cross edges, confusing modularity. But Infomap might handle it better. Should we weight imports higher for Infomap? Not necessarily; Infomap cares about path frequency, and a clique with many internal edges will anyway appear strong.
• We can monitor results: if Infomap merges two feature folders into one community, perhaps it’s because there are import edges between them. We might then consider down-weighting those cross-feature import edges (maybe treat them as weight < 1) to see if it splits them. But altering real data could just mask true coupling, so it’s often better not to fiddle unless we intend to bias the outcome knowingly.
In summary, Infomap’s outcome can be adjusted by weights, but we should use that sparingly. For now, equal weighting is fine. Key hyperparameters lie more in controlling the algorithm’s search and output format (trials, two-level) than in adjusting resolution (since Infomap has no explicit resolution parameter like modularity algorithms do).
5. Random seed: Infomap (especially via the Python API or command-line) allows setting a random seed for reproducibility (--seed option). We should use this in our pipeline to ensure that we always get the same result for the same input. While Infomap is stable, different random seeds could yield slightly different partitions if there are ties or symmetrical solutions. Setting a fixed seed (e.g., seed=42) will make the algorithm deterministic. This is highly recommended for a production tool – we want slice detection to be repeatable. Additionally, if we run multiple trials, Infomap will vary the seed internally for each trial but we can control the sequence if needed.
6. Multi-threading and speed: Not a concern for 50 nodes, but Infomap can use multiple threads for larger graphs. We can disable or enable OpenMP but it won’t matter here. Just ensure it’s not nondeterministic due to parallelism (setting a seed usually covers this, but if needed, run single-threaded for absolute determinism).
Sensitivity analysis: Overall, Infomap is not highly sensitive to hyperparameters for small graphs – you typically get the “right” communities without much tweaking. The main thing is making sure we’re using two-level vs multi-level appropriately and capturing the graph correctly. The number of trials ensures we’re not stuck in a suboptimal partition (which can happen in larger networks more so than in small ones). If results look off, it’s more likely due to data issues (e.g., missing edges, or an architectural quirk) than Infomap needing parameter tuning.
To summarize recommendations:
• Use two_level=true to get flat communities.
• Run a moderate number of trials (e.g. 10) or set a fixed seed for consistent results.
• Treat the graph as undirected (feed symmetric edges).
• Use default weights unless a specific need arises to tweak them.
• Check the output; if something is undesirable (like a utility alone), that’s addressed by post-processing rather than Infomap parameters.
Alternative Algorithms and Domain-Specific Approaches
Beyond Infomap and Leiden/Louvain, are there algorithms tailored for software dependency analysis? Several approaches have been explored over the years:
• Search-based clustering (high cohesion, low coupling): In software engineering research, a classic approach is to formulate clustering as an optimization of custom metrics. For example, the tool Bunch (Mitchell and Mancoridis, late 1990s) uses a genetic algorithm/hill-climbing to maximize a modularization quality (MQ) function . MQ rewards clusters that have many internal connections (high cohesion) and few external connections (low coupling), which is very much in line with what we want. Bunch doesn’t use graph algorithms like Louvain; instead it directly searches the space of partitions to maximize MQ. It was quite successful in research settings and can find good decompositions. However, it is computationally expensive (searching partition space) and can be non-deterministic. For 50 nodes it might be okay, but Bunch isn’t widely available as a maintained tool nowadays. Its principles, however, influenced later methods.
• Hierarchical agglomerative clustering: Before the popularity of community detection, researchers attempted simple clustering using distances/similarities. Approaches like single-link, complete-link, or average-link clustering have been applied to software by defining a similarity between files (for instance, based on number of shared dependencies or conceptual similarity). Anquetil and Lethbridge (1999) tried agglomerative clustering on software modules . Maqbool and Babri (2007) proposed a weighted combined linkage method for software clustering . These can yield a dendrogram of components; you then cut the dendrogram at some level to get a certain number of clusters. The downside is you need to choose a cut (or number of clusters) manually, and pure structural metrics might not capture everything. We prefer algorithms that determine an optimal partition automatically (like Infomap or modularity optimization do).
• Girvan–Newman (edge betweenness) and other network algorithms: The Girvan–Newman algorithm detects communities by iteratively removing edges with highest betweenness (bridges between communities). It’s very effective on small networks and can accurately find boundaries if the graph has obvious bottleneck edges. It was referenced in software clustering literature as well . For a 30-node graph, Girvan–Newman would yield a hierarchy of possible splits; we could look for a jump in modularity or some metric to decide on 5 communities, for example. It’s computationally heavier (O(n*m) per step), but on small graphs that’s fine. However, it doesn’t guarantee an optimal cut in terms of any global quality function, and one must interpret the results (maybe using modularity or silhouette to pick the best division). It’s an option if we wanted a deterministic, brute-force-ish method to validate the community structure. But it’s seldom necessary given Infomap’s output.
• Label Propagation: This is a very fast, simple algorithm where nodes adopt the majority label of their neighbors until convergence. It’s appealing for huge graphs, but it’s nondeterministic and can yield different partitions each run. There’s also a “Chinese Whispers” variation (which is similar in concept). These were mentioned in some microservice papers (e.g., one survey noted Infomap and Chinese Whispers used for service identification ). For our stable feature identification, label propagation is not ideal because we want repeatable, meaningful clusters. It might group things oddly especially in dense graphs (it can arbitrarily break symmetry). We likely avoid this for now.
• OSLOM, Infohiermap, etc.: There are more advanced community detection algorithms that handle overlapping communities or try to be resolution-limit-free (OSLOM, CPM, etc.). Overlapping communities (one file in multiple clusters) isn’t relevant here (each file belongs to exactly one feature in our assumption). So we can ignore those. Infohiermap is an extension of Infomap for hierarchical detection, but Infomap itself covers that nowadays.
• Domain-specific heuristics: Some tools incorporate semantic information (class names, documentation) or dynamic info (runtime traces) along with static dependencies to improve clustering. For instance, the study we cited combined structural communities with semantic cohesion measures to evaluate them  . Another approach is clustering based on co-change patterns from version history (files that often change together likely belong in the same feature). These are complementary techniques that could refine results, but they go beyond pure graph clustering.
• Design Structure Matrix (DSM) partitioning: This is a method borrowed from systems engineering. A DSM is basically an adjacency matrix of components; partitioning the DSM to minimize off-diagonal density is akin to clustering. Algorithms like the one by Sangal et al. (using Lattix tool) or others have done this . A recent paper by Savidis & Savaki (2022) used a DSM-based graph clustering algorithm tuned for software architecture (they report it performed well compared to some network algorithms) . These methods often require tuning parameters and are less known outside research circles.
• Leiden (with resolution tuning): Although not “domain-specific,” one could treat Leiden with an adjusted resolution parameter as an alternative approach. By increasing the resolution > 1, Leiden will find more, smaller communities (potentially splitting what we consider a feature into sub-parts). By decreasing resolution < 1, it will merge communities (potentially grouping features into bigger modules). If we wanted to enforce roughly 5 communities, we could sweep the resolution until Leiden outputs 5 clusters. This is a bit hacky but sometimes used when you desire a specific granularity. Infomap doesn’t have this explicit parameter, which is conceptually nice (it finds the “natural” granularity), but sometimes you might want to control the number of clusters. In such cases, other algorithms or manual slicing might be necessary.
Overall, the state-of-the-art in academia often uses general community detection for software graphs, as opposed to algorithms uniquely invented for code. The reasoning is that the graph abstractions (classes or files with dependencies) can be effectively handled by well-understood clustering algorithms. Infomap and Leiden are both popular in recent studies . There isn’t a definitive “optimal code clustering algorithm” established – it’s an active area of research and often the best approach combines multiple criteria (structure, semantics, evolution). For our immediate needs, focusing on Infomap (or possibly Leiden as a backup) is sensible, given their strong performance and availability.
It’s worth noting that the combination of approaches can yield better results: e.g., use Infomap to get an initial grouping, then refine using semantic analysis or input from developers. Some papers have done that (clustering + semantic cohesion checks, or clustering + manual adjustments) . In our context, we can use Infomap as the primary engine and then apply simple rules (like the edge-case handling above) to fine-tune the slices.
In conclusion, besides Infomap/Leiden:
• If we needed overlapping communities (which we don’t), algorithms like CPM or Link Communities exist.
• If we wanted guaranteed optimal modularity, exhaustive methods exist but are infeasible beyond ~30 nodes due to combinatorial explosion.
• If we wanted the absolute theoretically optimal for flow, one could brute-force minimize the map equation (but Infomap’s heuristic is excellent at that already).
• The old search-based methods (Bunch) are interesting but would essentially replicate what we get with modern methods, just slower.
Therefore, we will proceed with Infomap as our chosen algorithm, knowing that it stands on solid foundations (information theory, proven success in analogous domains) and is supported by both research and anecdotal evidence in the software modularity arena.
Recommendations and Best Practices for Vertical Slice Detection
Drawing from the above analysis, here is a consolidated set of recommendations for detecting feature modules in a codebase:
• Use Infomap for community detection: Its information-flow oriented clustering is a natural fit for code dependencies. Infomap will likely group files into slices that correspond to functional features, as evidenced by both conceptual reasoning and empirical studies (e.g., capturing top-level modules in a game engine ). Start with Infomap as the primary algorithm. Keep Leiden/Louvain as alternatives if needed for comparison or if Infomap results seem off.
• Run Infomap in two-level mode with multiple trials: Configure the algorithm to produce a flat partition (each community = one slice). For example, using the Infomap Python API: infomap = Infomap("--two-level --tries 10") (the --tries option corresponds to num_trials). This ensures we get a single layer of communities and improves reliability of finding the best clustering. Set a fixed random seed (--seed) for reproducibility of results across runs and environments.
• Prepare the dependency graph carefully: Include all relevant files as nodes. Add edges for explicit imports (A imports B) and for directory coupling (files in the same folder). Treat the graph as undirected (for each import, you can either add an undirected edge or two directed edges). Double-check that isolated files (with no imports) appear as isolated nodes – they will end up as their own communities. If using a library, ensure it’s reading edge weights correctly (in our case all edges might just be weight 1). You might consider weighting import edges slightly higher than directory edges if you notice the algorithm isn’t grouping some files that belong together, but initially keep it simple (equal weights). The key is that the graph truly reflects coupling: an import edge is a strong sign of coupling, and being in the same directory is a moderate sign – by combining them, we give the algorithm enough signal to cluster feature files.
• Post-process the communities with domain rules: Once Infomap outputs the communities, apply the edge-case handling logic:
• If a community has only one file and that file is something like main.go or a generic utility, decide how to present it. main.go alone is fine (it’s basically a slice of one). A utility alone might be better merged or labeled as “shared”.
• If you get one giant community containing what you expected to be multiple features, examine the dependency graph for why. If it’s due to actual heavy interconnection, you have two choices: accept it (the code might really function as one module) or manually split it (perhaps guided by known feature boundaries, essentially overriding the algorithm in this case). You could also re-run Leiden with a higher resolution to see if it splits that giant cluster, as a form of second opinion.
• Conversely, if you get too many tiny communities (say Infomap over-split something), consider merging ones that logically belong together. Infomap rarely yields completely spurious splits unless the data suggests it, but small clusters could sometimes be merged by a human who understands the code. The principle mentioned in the thesis is apt: it’s easier to merge small clusters than split big ones . Infomap tends toward smaller clusters when uncertain, which is a safer default. We can merge based on naming (e.g., if two clusters are both under features/auth/, they probably should be one).
• Validate communities with semantic or heuristic checks: As a sanity check, you can measure if the files in each Infomap cluster share common themes (e.g., similar name prefixes, or high internal code cohesion metrics). The research by Palomba et al. measured semantic cohesion and found Infomap clusters often made sense . We can do something simpler: ensure that each cluster mostly consists of files from one feature directory (if the project is structured by feature folders). In our zombie game example, we’d expect Infomap’s communities to align with the auth/, combat/, survivors/, resources/ folders. If they do, that’s a strong validation. If not, investigate why (maybe a file was pulled into the wrong cluster due to an import – e.g., if combat/routes.go got clustered with survivors because it imports something there, that flags a cross-module dependency).
• Speed and repeatability: Running Infomap on each codebase is very fast (<1 second), so it can even be integrated into a real-time analysis tool. Just ensure to use the same version of the algorithm/library for consistent results over time (Infomap updates are rare but possible). Because we fix the random seed, the output will not fluctuate between runs on the same input. If the code changes, of course the clusters can change, but that’s expected (and desired, since new or removed dependencies alter the graph).
• Edge case strategy: We will explicitly handle the “shared utilities” case by not treating them as separate features. One practical approach is to have a list of known shared components (like a utils package) and after clustering, attach those to a nearest community or mark them as “shared”. Another approach is to treat the presence of a single-node cluster that has edges to many other clusters as a sign of a utility; then handle it accordingly. Implementing this rule ensures the final reported slices focus on feature modules, not on generic libraries. Similarly, if a cluster is just an infrastructure piece (like maybe a database config file), we might not count it as a feature.
• Compare with Leiden (optional): For curiosity or verification, you could run Leiden algorithm (via the leidenalg library or similar) on the same graph. If Leiden yields a very different partition (e.g., splitting a community that Inf