October CodeRefinery HackMD, day 4
Icebreaker, day 4
The “helper” in the breakout rooms: what is a better name for their role?
- Tutor +3
- Facilitator
- Code Masters
- Code Refiners
- Helper :+1:
- Assistant
- https://www.thesaurus.com/browse/helper
- Guide
- The Auditors (of Reality)
- Exercise leader
- Minions
- Acolytes
- Expendables
- guardians of the refinery
- stack overflowers
- Senpai
- UnforGITables
- baby bugs
- deputy assistant educational communication junior executive managers
How are the breakout rooms going in general?:
- good
- so far so good, no complain! :+1:
- Good, I guess the quality correlates directly withN the experience of the helper.
- Helpful
- Works well!
Intro questions:
Reproducible research
Break-out session 1
Reconvene at 9:35
Discuss with your neighbors or among all participants
Computer programs are expected to produce the same output for the same inputs. Is that true for research software?
Can you give some examples? What can we do about it?
Word-count example
Let’s look at an example project which follows the project structure guidelines given above.
Since we’ll continue working with this repo, import it to your GitHub namespace by clicking “Use this template”. This generates a fresh repository from a template.
This project is about counting the frequency distribution of words in a given text, plotting results and testing Zipf’s law. We have subdirectories for raw data, source files, documentation, processsed data and results, and README and LICENSE files.
What are the requirements.txt, Dockerfile, and Snakefile files for?
Do you think this project is reproducible?
Group 16
Research Software Problems:
- Messing up input files, not finding old ones
- different software doing the same thing give different results
- different machine -> different results
- how to treat special cases/special solutions
- different defaults
- different versions of the same software
What can we do about it?
- Put it all into a big data blog, code, inputs, publication
- Read the documentation
-
Document everything
- Are you using version control? -
- How do you handle collaborative issues?
- How would you like it to work if you could decide?
Group 11
-
Are you using version control?
Yes, we use git to track the changes in our own codes, but never use it for collaboration in a large project.
-
How do you handle collaborative issues?
Use a log file and send code through email.
-
How would you like it to work if you could decide?
Group 4
Group 15
- Package management can cause problems, its not trivial to replicate the same python environment that was used in producing some research, especially as time goes on and new versions of packages are released.
Group 8
-
requirements.txt: which python packages and which versions
-
Dockerfile: creates a fully operative environment (includin python distribution, nano and the repo) on top of the reference operative system. Interestingly enough, it installs python packages one by one and does not use the requirements.txt file present in the repo itself
-
Snakefile: defines a series of operations and each time states which python files have to be executed and with which input files (the exact workings are misterious to us). Also allows to determine the number of threads/ parallel processes.
Group 14
- requirements.txt: In general well-written, versions of the used libraries are given. No need to state the versions of the software used to build the documentation and check the python code style. Python version used for testing this software is not known
Compare requirements.txt:
A:
- No information at all. +2
- This doesn’t seem like a good strategy.
- Which version of scipy? etc. +16
- no information of project
- .
B:
- Packages and versions are clearly given, but the master branch of the git repositories is given, which may change in the future
- Only packages are given, it implies that the repo works with the latest version, which may not be true. Dangerous.
- list of softwares, or minimum requirements but no further information
- One problem I anticipate with this approach is when th master branch evolves it might break our code
- .
C:
- I don’t know what these git links are. Would be useful to have a word on it.
- I also don’t know exactly the meaning of the git commands. However it
- seems the best alternative, since it specifies the exact location and commit from which to clone.+1
- Does seem possible to do automatic linking with this one.Documentation will probably be available in the github.
- The git links are the path to a source code bundled as a python package. Ideally, I would upload it first as a python package on PyPI (python package index) and then simply wrtie the package name in the requirements.txt like how the other packages are listed along with the version number. This makes for a clean dependency management and the package itself can also be used by others just like how you would use numpy or a pandas.
- I think this is the best way to state the requirements file. Version numbers are given and for each git repo a specific commit and/or tag is given to clone. +2
- software versions given with link to master branch
D:
- This seems to be the best description for me. Precise information given. :+2:
- What does “someproject” mean? And what about “anotherproject”? Where does this version refer to?
- _I’m not sure about the suitability of C vs D
- . “Someproject” and “Anotherproject” are probabaly git repo’s and not libraries that can be pulled by pip/conda. The git clone links should be given.
- I thought someproject and anotherproject are just generic names for this example. If not, it is of course a problem.
- Here you have to manually search for the project / repository on github & the version. What if you don’t know the repository link?
E: What problems do you forsee when you write down minimal version constraints like scipy>=1.0
?
- Future versions of the package might depreceate some functions or change how they work, or have different dependencies +2
Related questions:
- Can python or similar generate and read requirements.txt or is it meant to be created and read by humans?
- often written by humans and read by humans but also used by tools (tools like virtual environments, Conda, Binder, … understand this format)
- another popular and standard format to document Python dependencies is environment.yml (Conda)
- yes it can. Try ‘pip freeze > requirements.txt’ and you’ll have all the packages in the requirements file. I don’t remember the exact sintax, but there is something like a pip install that takes requirements.txt as an argument and installs all the packages therein.
- Does using the git+ statement in requirements.txt, just clone the repository or also install the package?
- it downloads the package from github and installs it, so it expects that the repository/package contains specific files like setup.py or similar
- this can be useful to test package installation during development before sharing it on PyPI (Python package index)
- Is D now better than C or still not good enough?
- D is pretty good, versions are defined, packages not likely to disappear
- Thank you for the clarification
- more advanced answer for library developers: if you develop a library which is a dependency of other tools or libraries, you may want to not over-specify dependencies: you may want to prefer version ranges rather than specific versions, otherwise this can create problems for the libraries depending on your library.
- Sorry, missed what pip freeze does. Could you explain again in short?
- write out the current environment into a requirements.txt file: all installed libraries and their versions (“freeze the current environment into a file”)
- Usage: pip freeze > requirements.txt
- I think I need refreshers for now about how to get myself into the git environment in terminal all the way through a few times still. I’m just still not certain of the order of commands and such
- Good point. We should remind these when we now use new tools. We will point it out when restarting after break.
- what do you do if two packages require the same dependency, but of different versions?
- then you have a problem, assuming both are in the same project :-) (“dependency hell”) you can then try to convince one of the two packages to relax its version requirements. see also the “advanced answer” 3 questions up, exactly for this reason.
- if these are two different projects, then you can set up a virtual environment or Conda environment for each of these
- This is why I usualyl don’t advocate strict dependencies for reusable software. my attempt at getting people to design their software well: https://scicomp.aalto.fi/scicomp/packaging-software/
- when usingusig ‘pip freeze > requirements.txt’, should we edit the file and keep only the modules we directly import into our projects? otherwise, it seems to me that we could have a VERY long list of apparent dependencies that we don’t really use in our project.
- I think there is a way to freeze only the packages actively installed (in conda you can do this with
--from-history
)
- What I do to avoid this is that I don’t
pip freeze
but rather I always first document a new installation to requirements.txt
, then I install it from that file, then I end up only with the “first-degree” dependencies. (+1)
- If you use this to say, publish a paper, your results may depend on the exact version of nested dependencies.
- I agree, for a paper I would document all dependencies, the full
requirements.txt
- lets say package A depends on package B. If I keep only package A in requirements, is it safe to assume package A will require the correct version of package B?
- depends on how much you trust it!
- If freezing an environment for a paper, may as well freeze all
- If freezing an environment for you to use later, I usually freeze the minimum.
- In R my packages update automatically, is that something I should shut off when I finish a project? I don’t know how to shut it off for only one project since my packages are commonly shared between projects
- For R I recommend to use https://rstudio.github.io/renv/articles/renv.html which is very similar to virtual environments in Python. This creates a
renv.lock
file which documents all dependencies. Each project can then have its own renv.lock
- I have to use Matlab. Is it so that it is enough to put only the matlab version.
- Documenting the version is good. Sorry I don’t know matlab well enough but how do you install libraries in matlab? Or is everything already packaged as one package with everything?
- I would indeed mention the MatLab version you are using and all the additional packages you use. For example, you might have installed the Signal Processing toolbox. Annoyingly, if you haven’t installed the right package, MatLab doesn’t always give this as an error, but just says a certain function is missing without referring to the missing package…
- Thank’s. I don’t even remember if I’ve installed packages. Good real-world example! :D
- Pretty often all the toolboxes (or all that you have licence for) get installed automatically, so then one might not notice which function was “core” Matlab and which in a toolbox, until it breaks when you try to run it somewhere else ;)
- Very general question: the very first thing was to put everything related to one “project” in one folder. But how do you define a “project”? I have stuff that starts up as one “simulation project” that ends up being used in multiple experiments/papers/whatever
- Putting everything into one folder indeed assumes that we don’t reuse things across projects but we also want to reuse things. So if a code gets used in several projects, it can become a library, and perhaps should live in its own Git repository. There are ways to include Git repos in other Git repos. Another option might be to put the library part into PyPI or Conda or CRAN and to use your own code as library in multiple projects.
- General feedback point: I don’t like this glancing over content. I would prefer either covering it in a depth that I can actually use it later on (like we did with git last week), or leave it out. Now I’m just getting slightly confused between all the different things covered. :+1:
- I second this, it leaves me confused and in a blurry state.
- Thanks for feedback. I understand.
- Sure. I also understand there is limited time and these are probably all really important things. But my capacity for learning is also limited, I guess.
- We will comment on that so that we do this better in remaining sessions.
- I sort of like getting a brief overview of stuff that’s out there, so that at least I know what exists and can look further in my own time. But there is a balance, the question below is a good example of too much detail which is just confusing for me.
- Thanks for the feedback: Since we’re short on time I tried to give Snakemake to the break-out rooms, so you have time to explore and discuss.
- As a helper, this needs to be communicated upfront, because it means I am also assuming the role of instructor.
- what does “record the environment in the image part or the recipe part” mean?
- where was this mentioned?
- at the end of https://coderefinery.github.io/reproducible-research/04-environments/
- this refers to containers where you can create images from recipes (e.g. Dockerfile or Singularity file). You could either take an image and add your dependencies on top of it (this would be “documenting in the recipe part”) or you could install all your dependencies into an image and publish that image (“in the image part”). We glanced super quick over it so it’s not your fault if this is unclear after 2 minutes of explanation. Happy to talk about it more after the session today.
- Is it good practice to create the virtual environment inside the project repo and then .gitignore it? Or is it better to have all virtual environments in the main python installation folder?
- You normally want to track changes to your
requirements.txt
in your git repository because it reflects the dependencies of that repo, the environment folder itself (e.g. .venv
) is system-specific and should be ignored.
- As to where to place them: I like to create them into the project folder, then I see the venv in the same place as the project and don’t forget where they are.
- Its’s the other way around. You first create a virtual environment and then start working on your project inside it. Right from cloning the git repositry to everything else that follows it is done after you have created the virtual environment.
- Things like this are really useful info, thx!
Exercise: snakemake
20 minutes, until xx:37
https://coderefinery.github.io/reproducible-research/05-workflow-management/#exercise-using-snakemake
Breakout room status:
-> We can have a walk-through of the snakemake exercise after todays session (please stay in the main room after 12CET/13EET). Hope we can clear things up there :)
Fair software
How to make your software reproducible
What are the other alternatives for Zenodo?
Break
:::danger
Break until xx:03
:::
Social coding and open software
you can find the slides here: https://cicero.xyz/v3/remark/0.14.0/github.com/coderefinery/social-coding/master/talk.md/#1
Snakemake demonstration
:::danger
Break until xx:45
:::
Feedback Day 4
Please write here something positive about today and something that we can improve on
- Very interesting and useful topics. Everyone doing a research education (e.g. BSc/MSc, PhD, Postdoc) should learn about reproducible and FAIR science. Thanks for creating a very good intro to this wide topic, and thanks for gathering useful resources for further reading! +1
-
The first hour went quite fast - it was confusing that many exercises/discussions were skipped. It was hard to know which exercise we were supposed to do when the breakout rooms started.
- [name=social coding]’s lecture was very interesting, still a bit fast (but he also realised it :-) Thank you), but luckily there is the recording and I will watch it again.
- Maybe it is worth to extend the course to more days even further?
-
The first past of the day was hard. Select better the material and don’t say we won’t cover it now, it is somehow stressing. Make a better selection (and point out where one could continue reading.)
The whole confusing about the exercise 4 could have been avoided by doing exercise 3 first, at least for me. [name=staff]’s explanation of the exercises on day 3 was a good example! Clear, clear instructions, we knew what we had to do.
The after session with [name=staff] helped to fix that I took something home with me from the first part of the day. Thank you very much.
- [name=staff]’s mic volume is lower than the other lecturers’. I want to learn from you but I found it difficult to hear. [name=staff]’s mic is the most crispy among all.
Always ask questions at the very bottom of this document, right above this. Switch to view mode if you are only watching.
We are monitoring this hackMD, but we will reply every now and then so that you can focus on the speaker.