git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unsustainable situation with ptests


Wow! Awesome. This is the 3rd time I remember seeing green run in >4yrs. :)

Thanks
Prasanth

> On May 15, 2018, at 5:28 PM, Jesus Camacho Rodriguez <jcamacho@xxxxxxxxxx> wrote:
> 
> We have just had the first clean run in a while:
> https://builds.apache.org/job/PreCommit-HIVE-Build/10971/testReport/
> 
> I will continue monitoring follow-up runs.
> 
> Thanks,
> -Jesús
> 
> 
> On 5/14/18, 11:28 PM, "Prasanth Jayachandran" <pjayachandran@xxxxxxxxxxxxxxx> wrote:
> 
>    Wondering if we can add a state transition from “Patch Available” to “Ready To Commit” which can only be triggered by ptest bot on green test run.
> 
>    Thanks
>    Prasanth
> 
> 
> 
>    On Mon, May 14, 2018 at 10:44 PM -0700, "Jesus Camacho Rodriguez" <jcamacho@xxxxxxxxxx<mailto:jcamacho@xxxxxxxxxx>> wrote:
> 
> 
>    I have been working on fixing this situation while commits were still coming in.
> 
>    All the tests that have been disabled are in:
>    https://issues.apache.org/jira/browse/HIVE-19509
>    I have created new issues to reenable each of them, they are linked to that issue.
>    Maybe I was slightly aggressive disabling some of the tests, however that seemed to be the only way to bring the tests failures with age count > 1 to zero.
> 
>    Instead of starting a vote to freeze the commits in another thread, I will start a vote to be stricter wrt committing to master, i.e., only commit if we get a clean QA run.
> 
>    We can discuss more about this issue over there.
> 
>    Thanks,
>    Jesús
> 
> 
> 
>    On 5/14/18, 4:11 PM, "Sergey Shelukhin"  wrote:
> 
>        Can we please make this freeze conditional, i.e. we unfreeze automatically
>        after ptest is clean (as evidenced by the clean HiveQA run on a given
>        JIRA).
> 
>        On 18/5/14, 15:16, "Alan Gates"  wrote:
> 
>> We should do it in a separate thread so that people can see it with the
>> [VOTE] subject.  Some people use that as a filter in their email to know
>> when to pay attention to things.
>> 
>> Alan.
>> 
>> On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
>> pjayachandran@xxxxxxxxxxxxxxx> wrote:
>> 
>>> Will there be a separate voting thread? Or the voting on this thread is
>>> sufficient for lock down?
>>> 
>>> Thanks
>>> Prasanth
>>> 
>>>> On May 14, 2018, at 2:34 PM, Alan Gates  wrote:
>>>> 
>>>> ​I see there's support for this, but people are still pouring in
>>> commits.
>>>> I proposed we have a quick vote on this to lock down the commits
>>> until we
>>>> get to green.  That way everyone knows we have drawn the line at a
>>> specific
>>>> point.  Any commits after that point would be reverted.  There isn't a
>>>> category in the bylaws that fits this kind of vote but I suggest lazy
>>>> majority as the most appropriate one (at least 3 votes, more +1s than
>>>> -1s).
>>>> 
>>>> Alan.​
>>>> 
>>>> On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
>>> vihang@xxxxxxxxxxxx>
>>>> wrote:
>>>> 
>>>>> I worked on a few quick-fix optimizations in Ptest infrastructure
>>> over
>>> the
>>>>> weekend which reduced the execution run from ~90 min to ~70 min per
>>> run. I
>>>>> had to restart Ptest multiple times. I was resubmitting the patches
>>> which
>>>>> were in the queue manually, but I may have missed a few. In case you
>>> have a
>>>>> patch which is pending pre-commit and you don't see it in the queue,
>>> please
>>>>> submit it manually or let me know if you don't have access to the
>>> jenkins
>>>>> job. I will continue to work on the sub-tasks in HIVE-19425 and will
>>> do
>>>>> some maintenance next weekend as well.
>>>>> 
>>>>> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
>>>>> jcamacho@xxxxxxxxxx> wrote:
>>>>> 
>>>>>> Vineet has already been working on disabling those tests that were
>>> timing
>>>>>> out. I am working on disabling those that are generating different q
>>>>> files
>>>>>> consistently for last ptests n runs. I am keeping track of all these
>>>>> tests
>>>>>> in https://issues.apache.org/jira/browse/HIVE-19509.
>>>>>> 
>>>>>> -Jesús
>>>>>> 
>>>>>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
>>>>>> pjayachandran@xxxxxxxxxxxxxxx> wrote:
>>>>>> 
>>>>>>   +1 on freezing commits until we get repetitive green tests. We
>>> should
>>>>>> probably disable (and remember in a jira to reenable then at later
>>> point)
>>>>>> tests that are flaky to get repetitive green test runs.
>>>>>> 
>>>>>>   Thanks
>>>>>>   Prasanth
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>   On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
>>>>> lirui.fudan@xxxxxxxxx
>>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>>   +1 to freezing commits until we stabilize
>>>>>> 
>>>>>>   On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
>>>>>>   wrote:
>>>>>> 
>>>>>>> In order to understand the end-to-end precommit flow I would like
>>>>> to
>>>>>> get
>>>>>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
>>>>>> know how
>>>>>>> can I get that?
>>>>>>> 
>>>>>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
>>>>>>> jcamacho@xxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>>> Bq. For the short term green runs, I think we should @Ignore the
>>>>>> tests
>>>>>>>> which
>>>>>>>> are known to be failing since many runs. They are anyways not
>>>>> being
>>>>>>>> addressed as such. If people think they are important to be run
>>>>> we
>>>>>> should
>>>>>>>> fix them and only then re-enable them.
>>>>>>>> 
>>>>>>>> I think that is a good idea, as we would minimize the time that
>>>>> we
>>>>>> halt
>>>>>>>> development. We can create a JIRA where we list all tests that
>>>>> were
>>>>>>>> failing, and we have disabled to get the clean run. From that
>>>>>> moment, we
>>>>>>>> will have zero tolerance towards committing with failing tests.
>>>>>> And we
>>>>>>> need
>>>>>>>> to pick up those tests that should not be ignored and bring them
>>>>>> up again
>>>>>>>> but passing. If there is no disagreement, I can start working on
>>>>>> that.
>>>>>>>> 
>>>>>>>> Once I am done, I can try to help with infra tickets too.
>>>>>>>> 
>>>>>>>> -Jesús
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
>>>>>>>> 
>>>>>>>>   +1. I strongly vote for freezing commits and getting our
>>>>>> testing
>>>>>>>> coverage in acceptable state.  We have been struggling to
>>>>> stabilize
>>>>>>>> branch-3 due to test failures and releasing Hive 3.0 in current
>>>>>> state
>>>>>>> would
>>>>>>>> be unacceptable.
>>>>>>>> 
>>>>>>>>   Currently there are quite a few test suites which are not
>>>>> even
>>>>>>> running
>>>>>>>> and are being timed out. We have been committing patches (to both
>>>>>>> branch-3
>>>>>>>> and master) without test coverage for these tests.
>>>>>>>>   We should immediately figure out what’s going on before we
>>>>>> proceed
>>>>>>>> with commits.
>>>>>>>> 
>>>>>>>>   For reference following test suites are timing out on
>>>>> master: (
>>>>>>>> https://issues.apache.org/jira/browse/HIVE-19506)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   TestDbNotificationListener - did not produce a TEST-*.xml
>>>>> file
>>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestHCatHiveCompatibility - did not produce a TEST-*.xml file
>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestNegativeCliDriver - did not produce a TEST-*.xml file
>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
>>>>> file
>>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestSequenceFileReadWrite - did not produce a TEST-*.xml file
>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestTxnExIm - did not produce a TEST-*.xml file (likely timed
>>>>>> out)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   Vineet
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
>>>>>>> vihang@xxxxxxxxxxxx
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>   +1 There are many problems with the test infrastructure and
>>>>> in
>>>>>> my
>>>>>>>> opinion
>>>>>>>>   it has not become number one bottleneck for the project. I
>>>>> was
>>>>>>> looking
>>>>>>>> at
>>>>>>>>   the infrastructure yesterday and I think the current
>>>>>> infrastructure
>>>>>>>> (even
>>>>>>>>   its own set of problems) is still under-utilized. I am
>>>>>> planning to
>>>>>>>> increase
>>>>>>>>   the number of threads to process the parallel test batches to
>>>>>> start
>>>>>>>> with.
>>>>>>>>   It needs a restart on the server side. I can do it now, it
>>>>>> folks are
>>>>>>>> okay
>>>>>>>>   with it. Else I can do it over weekend when the queue is
>>>>> small.
>>>>>>>> 
>>>>>>>>   I listed the improvements which I thought would be useful
>>>>> under
>>>>>>>>   https://issues.apache.org/jira/browse/HIVE-19425 but frankly
>>>>>>> speaking
>>>>>>>> I am
>>>>>>>>   not able to devote as much time as I would like to on it. I
>>>>>> would
>>>>>>>>   appreciate if folks who have some more time if they can help
>>>>>> out.
>>>>>>>> 
>>>>>>>>   I think to start with https://issues.apache.org/
>>>>>>> jira/browse/HIVE-19429
>>>>>>>> will
>>>>>>>>   help a lot. We need to pack more test runs in parallel and
>>>>>> containers
>>>>>>>>   provide good isolation.
>>>>>>>> 
>>>>>>>>   For the short term green runs, I think we should @Ignore the
>>>>>> tests
>>>>>>>> which
>>>>>>>>   are known to be failing since many runs. They are anyways not
>>>>>> being
>>>>>>>>   addressed as such. If people think they are important to be
>>>>>> run we
>>>>>>>> should
>>>>>>>>   fix them and only then re-enable them.
>>>>>>>> 
>>>>>>>>   Also, I feel we need light-weight test run which we can run
>>>>>> locally
>>>>>>>> before
>>>>>>>>   submitting it for the full-suite. That way minor issues with
>>>>>> the
>>>>>>> patch
>>>>>>>> can
>>>>>>>>   be handled locally. May be create a profile which runs a
>>>>>> subset of
>>>>>>>>   important tests which are consistent. We can apply some label
>>>>>> that
>>>>>>>>   pre-checkin-local tests are runs successful and only then we
>>>>>> submit
>>>>>>>> for the
>>>>>>>>   full-suite.
>>>>>>>> 
>>>>>>>>   More thoughts are welcome. Thanks for starting this
>>>>>> conversation.
>>>>>>>> 
>>>>>>>>   On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
>>>>>>>>   jcamacho@xxxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>>   I believe we have reached a state (maybe we did reach it a
>>>>>> while ago)
>>>>>>>> that
>>>>>>>>   is not sustainable anymore, as there are so many tests
>>>>> failing
>>>>>> /
>>>>>>>> timing out
>>>>>>>>   that it is not possible to verify whether a patch is breaking
>>>>>> some
>>>>>>>> critical
>>>>>>>>   parts of the system or not. It also seems to me that due to
>>>>> the
>>>>>>>> timeouts
>>>>>>>>   (maybe due to infra, maybe not), ptest runs are taking even
>>>>>> longer
>>>>>>> than
>>>>>>>>   usual, which in turn creates even longer queue of patches.
>>>>>>>> 
>>>>>>>>   There is an ongoing effort to improve ptests usability (
>>>>>>>>   https://issues.apache.org/jira/browse/HIVE-19425), but apart
>>>>>> from
>>>>>>>> that,
>>>>>>>>   we need to make an effort to stabilize existing tests and
>>>>>> bring that
>>>>>>>>   failure count to zero.
>>>>>>>> 
>>>>>>>>   Hence, I am suggesting *we stop committing any patch before
>>>>> we
>>>>>> get a
>>>>>>>> green
>>>>>>>>   run*. If someone thinks this proposal is too radical, please
>>>>>> come up
>>>>>>>> with
>>>>>>>>   an alternative, because I do not think it is OK to have the
>>>>>> ptest
>>>>>>> runs
>>>>>>>> in
>>>>>>>>   their current state. Other projects of certain size (e.g.,
>>>>>> Hadoop,
>>>>>>>> Spark)
>>>>>>>>   are always green, we should be able to do the same.
>>>>>>>> 
>>>>>>>>   Finally, once we get to zero failures, I suggest we are less
>>>>>> tolerant
>>>>>>>> with
>>>>>>>>   committing without getting a clean ptests run. If there is a
>>>>>> failure,
>>>>>>>> we
>>>>>>>>   need to fix it or revert the patch that caused it, then we
>>>>>> continue
>>>>>>>>   developing.
>>>>>>>> 
>>>>>>>>   Please, let’s all work together as a community to fix this
>>>>>> issue,
>>>>>>> that
>>>>>>>> is
>>>>>>>>   the only way to get to zero quickly.
>>>>>>>> 
>>>>>>>>   Thanks,
>>>>>>>>   Jesús
>>>>>>>> 
>>>>>>>>   PS. I assume the flaky tests will come into the discussion.
>>>>>> Let´s see
>>>>>>>>   first how many of those we have, then we can work to find a
>>>>>> fix.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>   --
>>>>>>   Best regards!
>>>>>>   Rui Li
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
> 
> 
> 
> 
> 
> 
> 
>