Skip to content

Core Sources

The purpose of a Source is to provide CodeSurvey with repositories of code (referred to as Repos) to analyze. A Source may retrieve code from specific local directories or remote repositories, or may use APIs to sample repositories from a large pool, such as from code-hosting platforms like GitHub.

Built-In Sources

CodeSurvey provides the following built-in Sources for you to survey code from common types of code repositories:

codesurvey.sources.LocalSource

Bases: Source

Source of Repos from local filesystem directories.

Example usage:

LocalSource([
    'path/to/my-source-code-directory',
])
Source code in codesurvey/sources/core.py
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
class LocalSource(Source):
    """
    Source of Repos from local filesystem directories.

    Example usage:

    ```python
    LocalSource([
        'path/to/my-source-code-directory',
    ])
    ```

    """
    default_name = 'local'

    def __init__(self, dirs: Sequence[str], *, name: Optional[str] = None):
        """
        Args:
            dirs: Paths to the local source code directory of each Repo
            name: Name to identify the Source. If `None`, defaults to 'local'.
        """
        self.dirs = dirs
        super().__init__(name=name)

    def fetch_repo(self, repo_key: str) -> Repo:
        return self.repo(
            key=repo_key,
            path=repo_key,
        )

    def repo_generator(self) -> Iterator[Repo]:
        for repo_dir in self.dirs:
            yield self.fetch_repo(repo_dir)

__init__(dirs: Sequence[str], *, name: Optional[str] = None)

Parameters:

  • dirs (Sequence[str]) –

    Paths to the local source code directory of each Repo

  • name (Optional[str], default: None ) –

    Name to identify the Source. If None, defaults to 'local'.

Source code in codesurvey/sources/core.py
143
144
145
146
147
148
149
150
def __init__(self, dirs: Sequence[str], *, name: Optional[str] = None):
    """
    Args:
        dirs: Paths to the local source code directory of each Repo
        name: Name to identify the Source. If `None`, defaults to 'local'.
    """
    self.dirs = dirs
    super().__init__(name=name)

codesurvey.sources.GitSource

Bases: Source

Source of Repos from remote Git repositories.

Repos are downloaded into a local directory for analysis.

Example usage:

GitSource([
    'https://github.com/whenofpython/codesurvey',
])
Source code in codesurvey/sources/core.py
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
class GitSource(Source):
    """
    Source of Repos from remote Git repositories.

    Repos are downloaded into a local directory for analysis.

    Example usage:

    ```python
    GitSource([
        'https://github.com/whenofpython/codesurvey',
    ])
    ```

    """
    default_name = 'git'

    def __init__(self, repo_urls: Sequence[str], *, name: Optional[str] = None):
        """
        Args:
            repo_urls: URLs of remote Git repositories.
            name: Name to identify the Source. If `None`, defaults to 'git'.
        """
        self.repo_urls = repo_urls
        super().__init__(name=name)

    def fetch_repo(self, repo_key: str) -> Repo:
        try:
            temp_dir = fetch_git_repo(repo_key)
        except SourceError as ex:
            raise SourceError(f'Source {self} failed to clone from GitHub: {ex}')
        return self.repo(
            key=repo_key,
            path=temp_dir,
            # When the repo is finished being used, the temp_dir
            # should be deleted:
            cleanup=partial(TemporaryDirectory._rmtree, temp_dir),  # type: ignore[attr-defined]
        )

    def repo_generator(self) -> Iterator[RepoThunk]:
        for repo_url in self.repo_urls:
            yield self.repo_thunk(
                key=repo_url,
                thunk=partial(self.fetch_repo, repo_url),
            )

__init__(repo_urls: Sequence[str], *, name: Optional[str] = None)

Parameters:

  • repo_urls (Sequence[str]) –

    URLs of remote Git repositories.

  • name (Optional[str], default: None ) –

    Name to identify the Source. If None, defaults to 'git'.

Source code in codesurvey/sources/core.py
195
196
197
198
199
200
201
202
def __init__(self, repo_urls: Sequence[str], *, name: Optional[str] = None):
    """
    Args:
        repo_urls: URLs of remote Git repositories.
        name: Name to identify the Source. If `None`, defaults to 'git'.
    """
    self.repo_urls = repo_urls
    super().__init__(name=name)

codesurvey.sources.GithubSampleSource

Bases: Source

Source of Repos sampled from GitHub's search API.

Repos are sampled from randomly selected pages of GitHub search results, and downloaded to a temporary directory for analysis.

For explanations of GitHub search parameters, see: https://docs.github.com/en/free-pro-team@latest/rest/search/search#search-repositories

GitHub authentication credentials can be provided to increase rate limits. See: https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api

Example usage:

GithubSampleSource(language='python')
Source code in codesurvey/sources/core.py
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
class GithubSampleSource(Source):
    """Source of Repos sampled from GitHub's search API.

    Repos are sampled from randomly selected pages of GitHub search
    results, and downloaded to a temporary directory for analysis.

    For explanations of GitHub search parameters, see:
    https://docs.github.com/en/free-pro-team@latest/rest/search/search#search-repositories

    GitHub authentication credentials can be provided to increase rate
    limits. See:
    https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api

    Example usage:

    ```python
    GithubSampleSource(language='python')
    ```

    """
    default_name = 'github_sample'

    REPOS_PER_PAGE = 100
    # GitHub only returns the first 1,000 search results
    MAX_RESULTS = 1000

    def __init__(self, *,
                 search_query: str = '',
                 language: Optional[str],
                 max_kb: Optional[int] = 50_000,
                 sort: str = 'updated',
                 auth_username: Optional[str] = None,
                 auth_token: Optional[str] = None,
                 random_seed: Optional[int] = None,
                 name: Optional[str] = None):
        """
        Args:
            search_query: An optional search query for GitHub search.
            language: An optional constraint for GitHub's repository
                language tag.
            max_kb: To avoid downloading excessively large repositories,
                limits the maximum kilobyte size of sampled Repos.
            sort: Sort order for GitHub search. Important as GitHub will
                only return the first 1000 pages of search results to sample
                from. Defaults to searching for recently updated repositories.
            auth_username: Username for GitHub authentication.
            auth_token: Token for GitHub authentication.
            random_seed: Random seed for sampling pages of search results.
                If `None`, a randomly selected seed is used.
            name: Name to identify the Source. If `None`, defaults
                to 'github_sample'.
        """
        self.search_query = search_query
        self.language = language
        self.max_kb = max_kb
        self.sort = sort
        self.auth = (auth_username, auth_token) if auth_username and auth_token else None
        self.random_seed = random_seed
        super().__init__(name=name)

    def _search_repos(self, *, page: int = 1) -> dict:
        """
        Makes a GitHub repo search API call for the specified page index.

        Returns a dictionary containing result metadata (total number of pages)
        and a list of dictionaries containing metadata for found repos.

        See:

        * https://docs.github.com/en/rest/search#search-repositories
        * https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories
        """
        q_parts = []
        if self.search_query:
            q_parts.append(self.search_query)
        if self.language is not None:
            q_parts.append(f'language:{self.language}')
        if self.max_kb is not None:
            q_parts.append(f'size:<={self.max_kb}')

        params: Dict[str, Union[str, int]] = {
            'q': ' '.join(q_parts),
            'sort': self.sort,
            'per_page': self.REPOS_PER_PAGE,
            'page': page,
        }

        r = requests.get(
            'https://api.github.com/search/repositories',
            auth=self.auth,
            params=params,
        )
        r_json = r.json()
        return {
            # Return the total number of result pages that can be
            # sampled from.
            'page_count': min(self.MAX_RESULTS, r_json['total_count']) / self.REPOS_PER_PAGE,
            # Restrict the list of returned repos to those that
            # have the (optionally) specified language.
            'repos': [
                item for item in r_json['items']
                if (self.language is None or (str(item['language']).lower() == self.language))
            ],
        }

    def _clone_repo(self, repo_data: dict) -> Repo:
        """Helper to clone a Git repository given repo_data from the GitHub repos API."""
        try:
            temp_dir = fetch_git_repo(repo_data['clone_url'])
        except SourceError as ex:
            raise SourceError(f'Source {self} failed to clone from GitHub: {ex}')
        return self.repo(
            key=repo_data['full_name'],
            path=temp_dir,
            # When the repo is finished being used, the temp_dir
            # should be deleted:
            cleanup=partial(TemporaryDirectory._rmtree, temp_dir),  # type: ignore[attr-defined]
            metadata={
                'stars': repo_data['stargazers_count'],
            },
        )

    def fetch_repo(self, repo_key: str) -> Repo:
        r = requests.get(
            'https://api.github.com/repos/{repo_key}',
            auth=self.auth,
        )
        return self._clone_repo(r.json())

    def repo_generator(self) -> Iterator[RepoThunk]:
        rng = Random(self.random_seed)
        page_count = 1
        while True:
            logger.info(f'Source "{self}" searching GitHub for repos')
            search_result = self._search_repos(page=rng.randint(1, page_count))
            page_count = int(search_result['page_count'])
            for repo_data in search_result['repos']:
                yield self.repo_thunk(
                    key=repo_data['full_name'],
                    thunk=partial(self._clone_repo, repo_data),
                )

__init__(*, search_query: str = '', language: Optional[str], max_kb: Optional[int] = 50000, sort: str = 'updated', auth_username: Optional[str] = None, auth_token: Optional[str] = None, random_seed: Optional[int] = None, name: Optional[str] = None)

Parameters:

  • search_query (str, default: '' ) –

    An optional search query for GitHub search.

  • language (Optional[str]) –

    An optional constraint for GitHub's repository language tag.

  • max_kb (Optional[int], default: 50000 ) –

    To avoid downloading excessively large repositories, limits the maximum kilobyte size of sampled Repos.

  • sort (str, default: 'updated' ) –

    Sort order for GitHub search. Important as GitHub will only return the first 1000 pages of search results to sample from. Defaults to searching for recently updated repositories.

  • auth_username (Optional[str], default: None ) –

    Username for GitHub authentication.

  • auth_token (Optional[str], default: None ) –

    Token for GitHub authentication.

  • random_seed (Optional[int], default: None ) –

    Random seed for sampling pages of search results. If None, a randomly selected seed is used.

  • name (Optional[str], default: None ) –

    Name to identify the Source. If None, defaults to 'github_sample'.

Source code in codesurvey/sources/core.py
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
def __init__(self, *,
             search_query: str = '',
             language: Optional[str],
             max_kb: Optional[int] = 50_000,
             sort: str = 'updated',
             auth_username: Optional[str] = None,
             auth_token: Optional[str] = None,
             random_seed: Optional[int] = None,
             name: Optional[str] = None):
    """
    Args:
        search_query: An optional search query for GitHub search.
        language: An optional constraint for GitHub's repository
            language tag.
        max_kb: To avoid downloading excessively large repositories,
            limits the maximum kilobyte size of sampled Repos.
        sort: Sort order for GitHub search. Important as GitHub will
            only return the first 1000 pages of search results to sample
            from. Defaults to searching for recently updated repositories.
        auth_username: Username for GitHub authentication.
        auth_token: Token for GitHub authentication.
        random_seed: Random seed for sampling pages of search results.
            If `None`, a randomly selected seed is used.
        name: Name to identify the Source. If `None`, defaults
            to 'github_sample'.
    """
    self.search_query = search_query
    self.language = language
    self.max_kb = max_kb
    self.sort = sort
    self.auth = (auth_username, auth_token) if auth_username and auth_token else None
    self.random_seed = random_seed
    super().__init__(name=name)

codesurvey.sources.TestSource

Bases: Source

Creates a single Repo in a temporary directory with specified files and contents.

Only use with trusted paths, as paths are not checked for absolute or parent directory navigation.

Source code in codesurvey/sources/core.py
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
class TestSource(Source):
    """Creates a single Repo in a temporary directory with specified files
    and contents.

    Only use with trusted paths, as paths are not checked for absolute
    or parent directory navigation.

    """
    default_name = 'test'

    def __init__(self, path_to_content: Mapping[str, str], *, name: Optional[str] = None):
        """
        Args:
            path_to_content: Mapping of paths to contents for files to create
                in a test Repo directory
            name: Name to identify the Source. If `None`, defaults to 'test'.
        """
        self.path_to_content = path_to_content
        super().__init__(name=name)

    def fetch_repo(self, repo_key: str) -> Repo:
        return self.repo(
            key=repo_key,
            path=repo_key,
            # When the repo is finished being used, the temporary
            # directory should be deleted:
            cleanup=partial(TemporaryDirectory._rmtree, repo_key),  # type: ignore[attr-defined]
        )

    def repo_generator(self) -> Iterator[Repo]:
        temp_dir = mkdtemp()
        for path, content in self.path_to_content.items():
            path_head, path_tail = os.path.split(path)
            path_dir = os.path.join(temp_dir, path_head)
            os.makedirs(path_dir, exist_ok=True)
            with open(os.path.join(path_dir, path_tail), 'w') as path_file:
                path_file.write(content)
        yield self.fetch_repo(temp_dir)

__init__(path_to_content: Mapping[str, str], *, name: Optional[str] = None)

Parameters:

  • path_to_content (Mapping[str, str]) –

    Mapping of paths to contents for files to create in a test Repo directory

  • name (Optional[str], default: None ) –

    Name to identify the Source. If None, defaults to 'test'.

Source code in codesurvey/sources/core.py
378
379
380
381
382
383
384
385
386
def __init__(self, path_to_content: Mapping[str, str], *, name: Optional[str] = None):
    """
    Args:
        path_to_content: Mapping of paths to contents for files to create
            in a test Repo directory
        name: Name to identify the Source. If `None`, defaults to 'test'.
    """
    self.path_to_content = path_to_content
    super().__init__(name=name)

Custom Sources

You can define your own Source to provide Repos from other storage providers, platforms or APIs. Simply define a class that inherits from Source and defines fetch_repo() and repo_generator() methods that return Repos, or RepoThunks that can be executed in a parallelizable sub-process in order to prepare a Repo.

Your Source should also specify a default_name class attribute that will be used to identify your Source in logs and results (except where a name is provided for a specific Source instance).

For example, to define a custom Source that recieves a custom_arg and directly returns Repos:

class CustomSource(Source):
    default_name = 'custom'

    def __init__(self, custom_arg, *, name: Optional[str] = None):
        self.custom_arg = custom_arg
        super().__init__(name=name)

    def fetch_repo(self, repo_key: str) -> Repo:
        repo_path = # TODO
        return self.repo(
            key=repo_key,
            path=repo_path,
        )

    def repo_generator(self) -> Iterator[Repo]:
        while True:
            repo_key = # TODO
            yield self.fetch_repo(repo_key)

Alternatively, your custom Source can delay downloading or otherwise preparing a Repo to a parallelizable sub-process by yielding RepoThunks from repo_generator():

def repo_generator(self) -> Iterator[RepoThunk]:
    while True:
        repo_key = # TODO
        yield self.repo_thunk(
            key=repo_key,
            thunk=functools.partial(self.fetch_repo, repo_key),
        )

Core Classes

codesurvey.sources.Source

Bases: ABC

Provides Repos to be anaylyzed by CodeSurvey.

Source code in codesurvey/sources/core.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
class Source(ABC):
    """Provides Repos to be anaylyzed by CodeSurvey."""

    default_name: str
    """Default name to be assigned to Sources of this type if a custom
    name is not specified."""

    def __init__(self, *, name: Optional[str] = None):
        """
        Args:
            name: Name to identify the Source. If `None`, defaults to the
                Source type's default_name
        """
        self.name = self.default_name if name is None else name
        if self.name is None:
            raise ValueError('Analyzer name cannot be None')

    @abstractmethod
    def fetch_repo(self, repo_key: str) -> Repo:
        """Prepares the [Repo][codesurvey.sources.Repo] with the given
        `repo_key` for analysis.

        Typically called internally by repo_generator or by a
        RepoThunk, but also useful for inspecting a Repo given it's
        key from a survey result.

        """

    @abstractmethod
    def repo_generator(self) -> Iterator[Union[Repo, RepoThunk]]:
        """Generator yielding [Repos][codesurvey.sources.Repo] ready for
        analysis or [RepoThunks][codesurvey.sources.RepoThunk] that
        can be executed to prepare them for analysis."""

    def __str__(self):
        return self.name

    def __repr__(self):
        return f'{self.__class__.__name__}({self})'

    def repo(self, **kwargs):
        """Internal helper to generate a Repo for this Source. Takes the same
        arguments as Repo except for source."""
        return Repo(source=self, **kwargs)

    def repo_thunk(self, **kwargs):
        """Internal helper to generate a RepoThunk for this Source. Takes the
        same arguments as RepoThunk except for source."""
        return RepoThunk(source=self, **kwargs)

default_name: str instance-attribute

Default name to be assigned to Sources of this type if a custom name is not specified.

__init__(*, name: Optional[str] = None)

Parameters:

  • name (Optional[str], default: None ) –

    Name to identify the Source. If None, defaults to the Source type's default_name

Source code in codesurvey/sources/core.py
84
85
86
87
88
89
90
91
92
def __init__(self, *, name: Optional[str] = None):
    """
    Args:
        name: Name to identify the Source. If `None`, defaults to the
            Source type's default_name
    """
    self.name = self.default_name if name is None else name
    if self.name is None:
        raise ValueError('Analyzer name cannot be None')

fetch_repo(repo_key: str) -> Repo abstractmethod

Prepares the Repo with the given repo_key for analysis.

Typically called internally by repo_generator or by a RepoThunk, but also useful for inspecting a Repo given it's key from a survey result.

Source code in codesurvey/sources/core.py
 94
 95
 96
 97
 98
 99
100
101
102
103
@abstractmethod
def fetch_repo(self, repo_key: str) -> Repo:
    """Prepares the [Repo][codesurvey.sources.Repo] with the given
    `repo_key` for analysis.

    Typically called internally by repo_generator or by a
    RepoThunk, but also useful for inspecting a Repo given it's
    key from a survey result.

    """

repo_generator() -> Iterator[Union[Repo, RepoThunk]] abstractmethod

Generator yielding Repos ready for analysis or RepoThunks that can be executed to prepare them for analysis.

Source code in codesurvey/sources/core.py
105
106
107
108
109
@abstractmethod
def repo_generator(self) -> Iterator[Union[Repo, RepoThunk]]:
    """Generator yielding [Repos][codesurvey.sources.Repo] ready for
    analysis or [RepoThunks][codesurvey.sources.RepoThunk] that
    can be executed to prepare them for analysis."""

repo(**kwargs)

Internal helper to generate a Repo for this Source. Takes the same arguments as Repo except for source.

Source code in codesurvey/sources/core.py
117
118
119
120
def repo(self, **kwargs):
    """Internal helper to generate a Repo for this Source. Takes the same
    arguments as Repo except for source."""
    return Repo(source=self, **kwargs)

repo_thunk(**kwargs)

Internal helper to generate a RepoThunk for this Source. Takes the same arguments as RepoThunk except for source.

Source code in codesurvey/sources/core.py
122
123
124
125
def repo_thunk(self, **kwargs):
    """Internal helper to generate a RepoThunk for this Source. Takes the
    same arguments as RepoThunk except for source."""
    return RepoThunk(source=self, **kwargs)

codesurvey.sources.Repo dataclass

A repository of code that is accessible in a local directory in order to be analyzed.

Source code in codesurvey/sources/core.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
@dataclass(frozen=True)
class Repo:
    """A repository of code that is accessible in a local directory in
    order to be analyzed."""

    source: 'Source'
    """Source the Repo is provided by."""

    key: str
    """Unique key of the Repo within its Source."""

    # TODO: Change to a pathlib.Path?
    path: str
    """Path to the local directory storing the Repo."""

    cleanup: Callable[[], None] = noop
    """Function to be called to remove or otherwise cleanup the Repo when
    analysis of it has finished."""

    metadata: Dict[str, Any] = field(default_factory=dict, compare=False)
    """Additional properties describing the Repo.

    The metadata structure may vary depending on the type of
    Source.

    """

    def __str__(self):
        return f'{self.source.name}:{self.key}'

    def __repr__(self):
        return f'{self.__class__.__name__}({self})'

source: Source instance-attribute

Source the Repo is provided by.

key: str instance-attribute

Unique key of the Repo within its Source.

path: str instance-attribute

Path to the local directory storing the Repo.

cleanup: Callable[[], None] = noop class-attribute instance-attribute

Function to be called to remove or otherwise cleanup the Repo when analysis of it has finished.

metadata: Dict[str, Any] = field(default_factory=dict, compare=False) class-attribute instance-attribute

Additional properties describing the Repo.

The metadata structure may vary depending on the type of Source.

codesurvey.sources.RepoThunk dataclass

An executable task to be run asynchronously to prepare a Repo.

Source code in codesurvey/sources/core.py
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
@dataclass(frozen=True)
class RepoThunk:
    """An executable task to be run asynchronously to prepare a Repo."""

    source: 'Source'
    """Source the Repo is provided by."""

    key: str
    """Unique key of the Repo within its Source."""

    thunk: Callable[[], Repo]
    """Function to be called to prepare and return the Repo."""

    def __str__(self):
        return f'{self.source.name}:{self.key}'

    def __repr__(self):
        return f'{self.__class__.__name__}({self})'

source: Source instance-attribute

Source the Repo is provided by.

key: str instance-attribute

Unique key of the Repo within its Source.

thunk: Callable[[], Repo] instance-attribute

Function to be called to prepare and return the Repo.