Jekyll2017-01-16T20:16:46+00:00https://siamezzze.github.io//Maria GlukhovaStudent; Outreachy internAPK, images and other stuff.2017-01-16T00:00:00+00:002017-01-16T00:00:00+00:00https://siamezzze.github.io/APK,-images-and-other-stuff<p>2 more weeks of my awesome Outreachy journey have passed, so it is time to make an update on my progress.</p>
<p>I continued my work on improving diffoscope by fixing bugs and completing wishlist items. These include:</p>
<h2 id="improving-apk-support">Improving APK support</h2>
<p>I worked on <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=850501">#850501</a> and
<a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=850502">#850502</a> to improve the way diffoscope handles APK files.
Thanks to Emanuel Bronshtein for providing clear description on how to reproduce these
bugs and ideas on how to fix them.</p>
<p>And special thanks to Chris Lamb for insisting on providing tests for these changes!
That part actually proved to be little more tricky, and I managed to mess up with these tests (extra thanks to Chris for
cleaning up the mess I created). Hope that also means I learned something from my mistakes.</p>
<p>Also, I was pleased to see <a href="https://verification.f-droid.org/">F-droid Verification Server</a> as a sign of F-droid progress
on reproducible builds effort - I hope these changes to diffoscope will help them!</p>
<h2 id="adding-support-for-image-metadata">Adding support for image metadata</h2>
<p>That came from <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=849395">#849395</a> - a request was made to compare image
metadata along with image content. Diffoscope has support for three types of images: JPEG, MS Windows Icon (*.ico) and PNG.
Among these, PNG already had good image metadata support thanks to <code class="highlighter-rouge">sng</code> tool, so I worked on .jpeg and .ico files support.
I initially tried to use <code class="highlighter-rouge">exiftool</code> for extracting metadata, but then I discovered it does not handle .ico files, so I decided
to use a bigger force - ImageMagick’s <code class="highlighter-rouge">identify</code> - for this task. I was glad to see it had that handy <code class="highlighter-rouge">-format</code> option I could use
to select only the necessary fields (I found their <code class="highlighter-rouge">-verbose</code>, well, too verbose for the task) and presenting them in the defined
form, negating the need of filtering its output.</p>
<p>What was particulary interesting and important for me in terms of learning: while working on this feature, I discovered that,
at the moment, diffoscope could not handle .ico files at all - <code class="highlighter-rouge">img2txt</code> tool, that was used for retrieving image content, did
not support that type of images. But instead of recognizing this as a bug and resolving it, I started to think of possible
workaround, allowing for retrieving image metadata even after retrieving image content failed.
Definetely not very good thinking. Thanks Mattia Rizzolo for actually recognizing this as a bug and filing it,
and Chris Lamb for fixing it!</p>
<h2 id="other-work">Other work</h2>
<h3 id="order-like-differences-part-2">Order-like differences, part 2</h3>
<p>In the previous post, I mentioned Lunar’s <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=848049#27">suggestion</a> to use
hashing for finding order-like difference in wide variety of input data. I implemented that idea, but after discussion with
my mentor, we decided it is probably not worth it - this change would alter quite a lot of things in core modules of diffoscope,
and the gain would be not really significant.</p>
<p>Still, implementing that was an important experience for me, as I had to hack on deepest and, arguably, most difficult
modules of diffoscope and gained some insight on how they work.</p>
<h3 id="comparing-with-several-tools-work-in-progress">Comparing with several tools (work in progress)</h3>
<p>Although my initial motivation for this idea was flawed (the workaround I mentioned earlier for .ico files), it still might be
useful to have a mechanism that would allow to run several commands for finding difference, and then give the output of those
that succeed, failing if and only if they all have failed.</p>
<p>One possible case when it might happen is when we use commands
coming from different tools, and one of them is not installed. It would be nice if we still used the other and not the
uninformative binary diff (that is a default fallback option for when something goes wrong with more “clever” comparison).
I am still in process of polishing this change, though, and still in doubt if it is needed at all.</p>
<h2 id="side-note---outreachy-and-my-university-progress">Side note - Outreachy and my university progress</h2>
<p>In my Outreachy application, I promised that if I am selected into this round, I will do everything I can to unload the
required time period from my university time commitements. I did that by moving most of my courses to the first half of the
academic year. Now, the main thing that is left for me to do is my Master’s thesis.</p>
<p>I consulted my scientific advisors from both universities that I am formally attending (<a href="http://sfedu.ru/index_eng.php">SFEDU</a>
and <a href="http://www.lut.fi/web/en/">LUT</a> - I am in double degree program), and as a result, they agreed to change my Master’s thesis
topic to match my Outreachy work.</p>
<p>Now, that should have sounded like an excellent news - merging these activities together actually mean I can allocate much more
time to my work on reproducible builds, even beyond the actual internship time period.
That was intended to remove a burden from my shoulders.</p>
<p>Still, I feel a bit uneasy. The drawback of this decision lies in fact I have no idea on how to write scientific report based
on pure practical work. I know other students from my universities have done such things before, but choosing my own topic means
my scientific advisors can’t help me much - this is just out of their area of expertise.</p>
<p>Well, wish me luck - I’m up to the challenge!</p>2 more weeks of my awesome Outreachy journey have passed, so it is time to make an update on my progress.Getting to know diffoscope better2017-01-03T00:00:00+00:002017-01-03T00:00:00+00:00https://siamezzze.github.io/Getting-to-know-diffoscope-better<p>I apologize to all potential readers of this blog for not writing a comprehensive “Introduction” post with details of the project I am taking part in during my internship, as well as some story about how I ended up there.</p>
<p>Let me just say that I was a Debian user for years when I discovered it is taking part in Outreachy as one of organisations. Their <a href="https://reproducible-builds.org/">Reproducible Builds</a> effort has a noble goal and a bunch of great people behind it - I had no chances not to get excited by it. Looking for a place where my skills could be of any use, I discovered <strong>diffoscope</strong> - the tool for in-depth comparassion of files, archives etc. My mentor, Mattia Rizzolo, supported my decision to work on it, so now I am concentrating my efforts on improving diffoscope.</p>
<p>As my first steps, I am doing small (but hopefully still somewhat important) job of fixing existing bugs. It helps me to better understand how diffoscope works, as well as introduces me to the workflow of opensource development.</p>
<p>During December, I have done several small contributions, mostly fixing bugs.</p>
<h3 id="test-data-and-jessie-backports">Test data and jessie-backports</h3>
<p>First of them could be somewhat called cleaning up after my own mistake, although that mistake wasn’t trivial. During the application period, I have fixed a bug with diffoscope failing while comparing symlinks to directory. That was a small change, but I included some tests for that case anyway.</p>
<p>…And that actually caused problems. With these tests, I included test data: two folders with symlinks. All was good in unstable version of Debian, but in jessie-backports, that commit caused build to fail. After some digging, I discovered the problem was caused by build process including copying that data. That was done using <em>shutils</em> Python module, and older version of that module, included in jessie, could not handle copying symlinks to directory properly.</p>
<p>Thanks to my mentor for giving me a hint on how to resolve this: using temporary folders and creating these symlinks at runtime. That way, we ensured tests run without problems during build process on jessie.</p>
<p><strong>What have I learned:</strong> A great deal, actually. I spent too much time on that one, but I learned how to build packages, what happens during <em>dpkg-buildpackage</em> run and what <em>debhelper</em> tools are for. I also learned a bit about what <em>chroot</em> is and how to use it for testing.</p>
<h3 id="icc-profile-files-and-file-type-recognizing-regexp">ICC profile files and file type recognizing regexp</h3>
<p>Another one was also about failing tests and, therefore, failing build. Failing tests were all due to ICC files were not recognized by diffoscope. Turned out libmagic got an update which changed the description of ICC profile files. Diffoscope was relying on regexp applied to file type description to recognize the file, so I changed regexp to reflect the changes in libmagic.</p>
<p><strong>What have I learned:</strong> How diffoscope “recognizes” file types. Got me thinking: maybe there is a better way? That regexp-based approach is doomed to cause problems with every file type description change. I have this question still lingering in my mind - maybe I will come up with an idea later.</p>
<h3 id="order-like-difference-in-text-files">Order-like difference in text files</h3>
<p>Next, I decided to do something a bit bigger and fullfilled a feature request. That request was for detecting order-like difference in text files (when files has the same lines, but in different order). I did it by collecting “added” and “removed” lines in <em>diff</em> output in lists, sorting and then comparing them.</p>
<p>Sadly, I forgot about one particular case - when one of the files is missing the newline at the end of file. I was kindly reminded of that quite soon in comments on the bug-tracker (thanks <em>danielsh</em>!) and have already fixed that.
I also recieved feedback on how better implement it deeper in the diffoscope - not using the results of diff, but rather comparing sum of hashes of the lines directly in the difference module. I am yet to try that.</p>
<p><strong>What have I learned:</strong> That a call to <em>diff</em> is actually the slowest part of the diffoscope run when done on two big text files. Could it help somehow in speeding it up? I don’t know yet.</p>
<p>I also learned to comment on bugs in Debian bugtracker and was surprised by how much feedback I got. Thanks to my mentor for pushing me to do that - I definetely need to overcome my fear of communications to be more effective!</p>
<h3 id="random-ftbfs">Random FTBFS</h3>
<p>There was also a very nasty bug that caused diffoscope to <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=848403">fail to be built from source randomly</a>, failing with non-informative <em>Fatal Python error: deallocated None</em>. It already seemed strange when it was first reported; It got only more strange when suddenly that bug ceased to be reproducible. We hoped that would mean that bug was caused by some external tool, and was fixed there. Turns out it was not that easy. I tested this on two separate computers and on virtual machine; I used different versions of diffoscope. Well. Seems like that bug is still somehow tied to diffoscope version and not some external tool version - I still can do git checkout 64 and be able to reproduce the bug (still randomly, though).</p>
<p>Although I spent quite a lot of time on that one, the only result was the information about connection between bug apperances and diffoscope version. I still wasn’t able to get to the root of the problem - hopefully, someone else will be able to, given the information I found.</p>
<p><strong>What have I learned:</strong> <em>git-bisect</em>! Thanks to my friend for pointing me to it, that tool came handy in that situation. Also, got some experience in catching nasty bugs like that (pity that no experience in squashing them).</p>
<p>I had some extra time commitements in December, one of them (Reproducible Builds Summit II) connected to my internship and one (my exam session in university) not. In January, I should be able to allocate more time to that work - I hope it will help me achieve more significant results.</p>
<p>Many thanks to Mattia Rizzolo, Chris Lamb, Holger Levsen and all other folks of Reproducible Builds project - I cannot stress enough how important your support is to me.</p>
<p>Wish you all a great 2017!</p>I apologize to all potential readers of this blog for not writing a comprehensive “Introduction” post with details of the project I am taking part in during my internship, as well as some story about how I ended up there.
Let me just say that I was a Debian user for years when I discovered it is taking part in Outreachy as one of organisations. Their Reproducible Builds effort has a noble goal and a bunch of great people behind it - I had no chances not to get excited by it. Looking for a place where my skills could be of any use, I discovered diffoscope - the tool for in-depth comparassion of files, archives etc. My mentor, Mattia Rizzolo, supported my decision to work on it, so now I am concentrating my efforts on improving diffoscope.
As my first steps, I am doing small (but hopefully still somewhat important) job of fixing existing bugs. It helps me to better understand how diffoscope works, as well as introduces me to the workflow of opensource development.
During December, I have done several small contributions, mostly fixing bugs.
Test data and jessie-backports
First of them could be somewhat called cleaning up after my own mistake, although that mistake wasn’t trivial. During the application period, I have fixed a bug with diffoscope failing while comparing symlinks to directory. That was a small change, but I included some tests for that case anyway.
…And that actually caused problems. With these tests, I included test data: two folders with symlinks. All was good in unstable version of Debian, but in jessie-backports, that commit caused build to fail. After some digging, I discovered the problem was caused by build process including copying that data. That was done using shutils Python module, and older version of that module, included in jessie, could not handle copying symlinks to directory properly.
Thanks to my mentor for giving me a hint on how to resolve this: using temporary folders and creating these symlinks at runtime. That way, we ensured tests run without problems during build process on jessie.
What have I learned: A great deal, actually. I spent too much time on that one, but I learned how to build packages, what happens during dpkg-buildpackage run and what debhelper tools are for. I also learned a bit about what chroot is and how to use it for testing.
ICC profile files and file type recognizing regexp
Another one was also about failing tests and, therefore, failing build. Failing tests were all due to ICC files were not recognized by diffoscope. Turned out libmagic got an update which changed the description of ICC profile files. Diffoscope was relying on regexp applied to file type description to recognize the file, so I changed regexp to reflect the changes in libmagic.
What have I learned: How diffoscope “recognizes” file types. Got me thinking: maybe there is a better way? That regexp-based approach is doomed to cause problems with every file type description change. I have this question still lingering in my mind - maybe I will come up with an idea later.
Order-like difference in text files
Next, I decided to do something a bit bigger and fullfilled a feature request. That request was for detecting order-like difference in text files (when files has the same lines, but in different order). I did it by collecting “added” and “removed” lines in diff output in lists, sorting and then comparing them.
Sadly, I forgot about one particular case - when one of the files is missing the newline at the end of file. I was kindly reminded of that quite soon in comments on the bug-tracker (thanks danielsh!) and have already fixed that.
I also recieved feedback on how better implement it deeper in the diffoscope - not using the results of diff, but rather comparing sum of hashes of the lines directly in the difference module. I am yet to try that.
What have I learned: That a call to diff is actually the slowest part of the diffoscope run when done on two big text files. Could it help somehow in speeding it up? I don’t know yet.
I also learned to comment on bugs in Debian bugtracker and was surprised by how much feedback I got. Thanks to my mentor for pushing me to do that - I definetely need to overcome my fear of communications to be more effective!
Random FTBFS
There was also a very nasty bug that caused diffoscope to fail to be built from source randomly, failing with non-informative Fatal Python error: deallocated None. It already seemed strange when it was first reported; It got only more strange when suddenly that bug ceased to be reproducible. We hoped that would mean that bug was caused by some external tool, and was fixed there. Turns out it was not that easy. I tested this on two separate computers and on virtual machine; I used different versions of diffoscope. Well. Seems like that bug is still somehow tied to diffoscope version and not some external tool version - I still can do git checkout 64 and be able to reproduce the bug (still randomly, though).
Although I spent quite a lot of time on that one, the only result was the information about connection between bug apperances and diffoscope version. I still wasn’t able to get to the root of the problem - hopefully, someone else will be able to, given the information I found.
What have I learned: git-bisect! Thanks to my friend for pointing me to it, that tool came handy in that situation. Also, got some experience in catching nasty bugs like that (pity that no experience in squashing them).
I had some extra time commitements in December, one of them (Reproducible Builds Summit II) connected to my internship and one (my exam session in university) not. In January, I should be able to allocate more time to that work - I hope it will help me achieve more significant results.
Many thanks to Mattia Rizzolo, Chris Lamb, Holger Levsen and all other folks of Reproducible Builds project - I cannot stress enough how important your support is to me.
Wish you all a great 2017!Reproducible Builds Summit 20162016-12-16T00:00:00+00:002016-12-16T00:00:00+00:00https://siamezzze.github.io/Reproducible-Builds-Summit_2016<p>This is the second week of my internship period for the Reproducible Builds project of Debian, and its is very fortunate that exactly then I was starting to dig more into the project, I got the opportunity to take part on Reproducible Builds Summit II, taking place in Berlin. People from different projects who take part in reproducible builds effort, gathered here to discuss their vison of the project, share knowledge and brainstorm ideas that would benefit us all.</p>
<p>For me, it was the invaluable chance to learn a lot about the project, hear from people who had worked on it from the beginning and those who use it.</p>
<p>Most part of the summit we spent in sessions - talks, discussions or hacking sessions made in a group of 3-8 people. Topics discussed varied from definition of reproducible builds to bootstraping, from writing documentation to .buildinfo files. In every session, we were discussing the important concepts, identifying problems and looking for ways of resolving them.</p>
<p>One of topics that got the most of my interest was .buildinfo files, ways of collecting them and using them for reproducibility checks.
The idea behind .buildinfo files is the need to record all the useful information about the particular build process: what source files were used, what was the compiler flags, enviroment variables etc. during the build, as well as hash of the resulting artifact(s). It is expected that comparassion of these files will help to identify if the software builds reproducible, and, if not, what affects the build.
As useful as they are, the use of .buildinfo comes with several questionsand ideas:</p>
<ul>
<li>
<p>To actually make them useful to users (or, most likely, developers), we will have to publish them somewhere and, ideally, integrate them into packaging system. That would allow us to check if the package has reproducibility issues. But how to ensure it? There were ideas like having “trusted builders” and check if there are any buildinfos signed by them, or just accept the package if it has enough “right” buildinfo.</p>
</li>
<li>
<p>What should and should not be included in the .buildinfo files? From the one hand, we want them to record as little information as it is necessary: only what is really necessary to reproduce the build. On the other hand, the reality is that there is stillplenty of non-reproducible software, and that software can capture different things it should not really capture during the build: timestamp, buildpath, locale, enviromental variables. To effectively debug these cases, we will need much more info, so it is wise to include as many information as it is safe to publish in the buildinfo files. Essentialy that boils down to whether main purpose of buildinfo files is to ensure reproducibility or debug unreproductibility.</p>
</li>
<li>
<p>Right of revocation. We are planing to publish buildinfo files, make assumptions based on them, probably have some concepts of “trusted builders”. But it is possible that we will face the need of making different statements about state of the package. It can be something like “I was able to reproduce the software yesterday, but today the building process gives different result” or “I submited the .buildinfo, but then I found a problem with my test environment - please ignore my last submission”. Whatever the case, we cannot just remove the uploaded buildinfo file; instead, we should submit the newfile with different results. But that situation has to be handled correctly by the system that uses these files for reproducibility checks (that one got a really interesting discussion on the binary transparency session; we found this problem resembling certificate transparency issue).</p>
</li>
<li>
<p>We want to include everything that is affecting the build into buildinfo files, but there is a obvious limit to what information can we safely publish. In theory, sensitive information like the hostname or number of CPU cores should not affect the build, but in reality, that can happen. Should we make this information public and, if not, how to handle its absence?</p>
</li>
</ul>
<p>It is worth noting that .buildinfo files were actively discussed even in the sessions that were, at a first glance, not connected to them. It shows how important the idea is and how many applications can it have.</p>
<p>Of course, I also paid a lot of attention to discussions around diffoscope, the tool I am going to focus my efforts on in the following several months. It got pretty much attention and it was inspiring to see how useful it is to many different projects. There was a session on that we discussed the future of diffoscope, what people expect of it and what features should be added. That gave me several new ideas on what should I try to do to improve that tool. There were also quite a lot of progress done in-between the sessions, during the hacking times. Sadly, I was not part of that progress, although now I am set to fulfilling the feature request for markdown output.</p>
<p>In general, I am very grateful for all the participiants for being friendly and open for all sort of question and discussion. I often had to ask people to explain some concept that I did not know, and they always took the time to explain it to me. The community of Reproducible Builds project is really great and welcoming!
Of course, that also reminded me of how little I still know. I wish I came here more prepared and was able to add something up to the effort. I do hope at least my questions were useful in fueling up the discussion and seing the problem from the different angle.</p>
<p>I tried to tell about what I saw and what I learned in a clear, meaningful way, but, in fact, I am just super-excited now, happy to meet all these great people and motivated to work on the project and be a part of community. Everything was awesome about these three days, and all I saw here inspired me greatly.
My sincere thanks to all organisators and participiants of this event!</p>This is the second week of my internship period for the Reproducible Builds project of Debian, and its is very fortunate that exactly then I was starting to dig more into the project, I got the opportunity to take part on Reproducible Builds Summit II, taking place in Berlin. People from different projects who take part in reproducible builds effort, gathered here to discuss their vison of the project, share knowledge and brainstorm ideas that would benefit us all.Updated my journal2016-11-25T00:00:00+00:002016-11-25T00:00:00+00:00https://siamezzze.github.io/Updated_my_journal<p>Hello!
I am Maria Glukhova, student from Lappeenranta University of Technology (LUT), studying intelligence computing.</p>
<p>I was selected to be an <a href="https://wiki.gnome.org/Outreachy/">Outreachy</a> intern for December-March this winter. I will work with Debian Reproducible Builds team, improving existing tools that can be used to check reproducibility of packages.</p>
<p>In this blog, I will keep updates about how my internship is going.</p>Hello!
I am Maria Glukhova, student from Lappeenranta University of Technology (LUT), studying intelligence computing.